Perguntas e respostas¶

  1. É possível prever o score de felicidade de um país baseado nos dados de ambos os datasets? (o "Country name" não deve ser considerado para este modelo)

    Sim! A cada atributo numérico na base pode ser aplicado um modelo de regressão, e a cada atributo categórico, um modelo de classificação.
    O que poderia impedir a aplicação de algum método de estimação de valores seria a má qualidade dos dados. Para provar que é possível ou ressaltar os motivos de ser inviável, será desenvolvida uma análise da qualidade dos dados, e a manipulação dos mesmos numa tentativa de padronização para aplicação dos modelos.
    O tratamento dos modelos aplicados como forma de justificativa a esta resposta se encontram no tópico Q1 - Estimativa de Ladder score com base nos atributos

  2. É possível identificar a região do mundo em que um país se encontra através da relação entre as métricas e o score de felicidade obtido? Explique e justifique sua resposta;

    Sim! Como a região do mundo é um atributo categórico, podemos aplicar algum modelo de classificação para determiná-la.
    Mais uma vez, um impeditivo seria a qualidade/ausência dos dados.
    O tratamento dos modelos aplicados como forma de justificativa a esta resposta se encontram no tópico Q2 - Estimativa de Region com base nos atributos

  1. Os dados de 2020 e/ou 2021 sofreram algum impacto devido à pandemia?

    Sim!
    Durante manipulação dos dados foi gerado um gráfico considerando a presença de dados faltantes (Q3.a) e após a inputação de dados e descarte de casos singulares (Q3.b)
    Em ambos os casos podemos observar uma mudança no comportamento dos indicadores, com crescimento ou descrescimento acentuado em quase todos os casos, ou com uma mudança do comportamento que se observava há anos.

Metodologia¶

  • A execução desta análise seguiu as seguintes etapas:
    • 1 - Ingestão dos dados:
      • Buscar os dados em seus locais de armazenamento, concatenação de bases e criação do dataframe de base de trabalho com dados brutos
    • 2 - Análise e Manipulção dos dados:
      • Correção/adequação de tipos de dados, análise e tratamento prévio de dados faltantes, análise de casos singulares
    • 3 - Inputação de dados:
      • Análise profunda de campos vazios e determinação do melhor método de preenchimento
    • 4 - Análise de Regiões:
      • Análise profunda do comportamento do atributo Regional indicator, cruzamento com fontes de dados externas e determinação da melhor abordagem para imputação de dados
    • 5 - Modelagem e estimativa da Região dos países
      • Preparação dos dados, treinamento do modelo, teste, análise de resultados, ajuste de hiper-parâmetros, seleção de features, re-treino e resultados
    • 6 - Modelagem e estimativa do Ladder Score
      • Preparação dos dados, treinamento do modelo, teste, análise de resultados, ajuste de hiper-parâmetros, seleção de features, re-treino e resultados


  • Por falta de tempo, em todos os modelos vou ficar devendo a tunagem dos hiper-parâmetros

Bibliotecas¶

In [2]:
import pandas as pd
import os
import numpy as np
import joblib

from pandas_profiling import ProfileReport as pr
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()

from imblearn.over_sampling import SMOTE, SMOTENC

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import ExtraTreesRegressor, RandomForestClassifier
from sklearn.metrics import mean_absolute_error, classification_report
from sklearn.metrics import recall_score, roc_auc_score, r2_score, accuracy_score
from sklearn.metrics import explained_variance_score, max_error, mean_squared_error
from catboost import CatBoostClassifier, CatBoostRegressor, Pool, metrics, cv

import warnings
warnings.filterwarnings('ignore') 



Ingestão¶

  • Foram adicionados ao conjunto de dados do problema, duas bases de dados com dados regionais dos países:

    • https://esa.un.org/MigFlows/Definition%20of%20regions.pdf
    • https://www.antwiki.org/wiki/Countries_by_Regions
  • O dataset regions_un é uma base gerada pela ONU

In [3]:
path = r"C:\jupyter notebooks\bornlogic\\"
In [4]:
input_dfs = dict()
for dirname, _, filenames in os.walk(path+"data"):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        input_dfs[filename.split('.')[0]] = pd.read_excel(os.path.join(dirname, filename))
C:\jupyter notebooks\bornlogic\\data\Data_2021.xls
C:\jupyter notebooks\bornlogic\\data\HistoricData.xls
C:\jupyter notebooks\bornlogic\\data\regions_antwiki.xls
C:\jupyter notebooks\bornlogic\\data\regions_un.xls
In [5]:
input_dfs.keys()
Out[5]:
dict_keys(['Data_2021', 'HistoricData', 'regions_antwiki', 'regions_un'])

Concatenação de dados¶

In [6]:
[*input_dfs['Data_2021'].columns]
Out[6]:
['Country name',
 'Regional indicator',
 'Ladder score',
 'Logged GDP per capita',
 'Social support',
 'Healthy life expectancy',
 'Freedom to make life choices',
 'Generosity',
 'Perceptions of corruption']
In [7]:
[*input_dfs['HistoricData'].columns]
Out[7]:
['Country name',
 'year',
 'Ladder score',
 'Logged GDP per capita',
 'Social support',
 'Healthy life expectancy',
 'Freedom to make life choices',
 'Generosity',
 'Perceptions of corruption',
 'Positive affect',
 'Negative affect']
In [8]:
input_dfs['HistoricData']['Regional indicator'] = None
input_dfs['HistoricData'] = input_dfs['HistoricData'][['Country name','Regional indicator','year','Ladder score','Logged GDP per capita',
                                                       'Social support','Healthy life expectancy','Freedom to make life choices',
                                                       'Generosity','Perceptions of corruption','Positive affect','Negative affect']]
In [9]:
input_dfs['Data_2021']['Positive affect'] = None
input_dfs['Data_2021']['Negative affect'] = None
input_dfs['Data_2021']['year'] = 2021
input_dfs['Data_2021'] = input_dfs['Data_2021'][['Country name','Regional indicator','year','Ladder score','Logged GDP per capita','Social support',
                                                    'Healthy life expectancy','Freedom to make life choices','Generosity','Perceptions of corruption',
                                                    'Positive affect','Negative affect']]
In [10]:
# Conferindo os dados
input_dfs['Data_2021'].head()
Out[10]:
Country name Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 Finland Western Europe 2021 7.8421 10.775202 0.953603 72.000000 0.949268 -0.097760 0.185846 None None
1 Denmark Western Europe 2021 7.6195 10.933176 0.954410 72.699753 0.945639 0.030109 0.178838 None None
2 Switzerland Western Europe 2021 7.5715 11.117368 0.941742 74.400101 0.918788 0.024629 0.291698 None None
3 Iceland Western Europe 2021 7.5539 10.877768 0.982938 73.000000 0.955123 0.160274 0.672865 None None
4 Netherlands Western Europe 2021 7.4640 10.931812 0.941601 72.400116 0.913116 0.175404 0.337938 None None
In [11]:
# Conferindo os dados
input_dfs['HistoricData'].head()
Out[11]:
Country name Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 Afghanistan None 2008 3.723590 7.370100 0.450662 50.799999 0.718114 0.167640 0.881686 0.517637 0.258195
1 Afghanistan None 2009 4.401778 7.539972 0.552308 51.200001 0.678896 0.190099 0.850035 0.583926 0.237092
2 Afghanistan None 2010 4.758381 7.646709 0.539075 51.599998 0.600127 0.120590 0.706766 0.618265 0.275324
3 Afghanistan None 2011 3.831719 7.619532 0.521104 51.919998 0.495901 0.162427 0.731109 0.611387 0.267175
4 Afghanistan None 2012 3.782938 7.705479 0.520637 52.240002 0.530935 0.236032 0.775620 0.710385 0.267919

Criação do Dataframe com "dados brutos"¶

In [12]:
input_dfs['Data_2021'].shape
Out[12]:
(149, 12)
In [13]:
input_dfs['HistoricData'].shape
Out[13]:
(1949, 12)
In [14]:
df = input_dfs['Data_2021'].append(input_dfs['HistoricData'])
df.sort_values(by=['Country name','year'], ascending=True, inplace=True)
df.index = range(len(df.index))
In [15]:
# shape final é a soma dos shapes separados, append ok
df.shape
Out[15]:
(2098, 12)
In [16]:
# Remoção de dados duplicados, se houver
df.drop_duplicates(subset=None, keep='first', inplace=True, ignore_index=False)
df.shape
Out[16]:
(2098, 12)
  • não há dados duplicados, a quantidade de amostras se manteve
In [17]:
df.head()
Out[17]:
Country name Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 Afghanistan None 2008 3.723590 7.370100 0.450662 50.799999 0.718114 0.167640 0.881686 0.517637 0.258195
1 Afghanistan None 2009 4.401778 7.539972 0.552308 51.200001 0.678896 0.190099 0.850035 0.583926 0.237092
2 Afghanistan None 2010 4.758381 7.646709 0.539075 51.599998 0.600127 0.120590 0.706766 0.618265 0.275324
3 Afghanistan None 2011 3.831719 7.619532 0.521104 51.919998 0.495901 0.162427 0.731109 0.611387 0.267175
4 Afghanistan None 2012 3.782938 7.705479 0.520637 52.240002 0.530935 0.236032 0.775620 0.710385 0.267919
In [18]:
# todos os dados de 2021 pra frente vêm de 'Data_2021'
df[df.year == 2021].shape == input_dfs['Data_2021'].shape
Out[18]:
True
In [19]:
# todos os dados anteriores a 2021 vêm de 'Data_2021'
df[df.year != 2021].shape == input_dfs['HistoricData'].shape
Out[19]:
True

Análise e manipulação do dataset¶

In [20]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2098 entries, 0 to 2097
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country name                  2098 non-null   object 
 1   Regional indicator            149 non-null    object 
 2   year                          2098 non-null   int64  
 3   Ladder score                  2098 non-null   float64
 4   Logged GDP per capita         2062 non-null   float64
 5   Social support                2085 non-null   float64
 6   Healthy life expectancy       2043 non-null   float64
 7   Freedom to make life choices  2066 non-null   float64
 8   Generosity                    2009 non-null   float64
 9   Perceptions of corruption     1988 non-null   float64
 10  Positive affect               1927 non-null   object 
 11  Negative affect               1933 non-null   object 
dtypes: float64(7), int64(1), object(4)
memory usage: 213.1+ KB


  • Dados ausentes em Regional indicator, Logged GDP per capita, Social support, Healthy life expectancy, Freedom to make life choices, Generosity, Perceptions of corruption, Positive affect, Negative affect

Correção dos tipos de dados¶

In [21]:
df['Positive affect'] = df['Positive affect'].astype(float)
df['Negative affect'] = df['Negative affect'].astype(float)

Análise de Dados faltantes¶

In [22]:
df.describe(include='all').T
Out[22]:
count unique top freq mean std min 25% 50% 75% max
Country name 2098 166 Zimbabwe 16 NaN NaN NaN NaN NaN NaN NaN
Regional indicator 149 10 Sub-Saharan Africa 36 NaN NaN NaN NaN NaN NaN NaN
year 2098.0 NaN NaN NaN 2013.768827 4.486449 2005.0 2010.0 2014.0 2018.0 2021.0
Ladder score 2098.0 NaN NaN NaN 5.471403 1.112682 2.375092 4.652504 5.391887 6.282982 8.018934
Logged GDP per capita 2062.0 NaN NaN NaN 9.373065 1.154252 6.635322 8.470213 9.462173 10.360714 11.648169
Social support 2085.0 NaN NaN NaN 0.812709 0.118202 0.290184 0.749633 0.834716 0.90529 0.987343
Healthy life expectancy 2043.0 NaN NaN NaN 63.478503 7.468781 32.299999 58.7045 65.279999 68.660004 77.099998
Freedom to make life choices 2066.0 NaN NaN NaN 0.746101 0.140774 0.257534 0.652307 0.766931 0.859147 0.985178
Generosity 2009.0 NaN NaN NaN -0.001023 0.161405 -0.33504 -0.115171 -0.026638 0.089205 0.698099
Perceptions of corruption 1988.0 NaN NaN NaN 0.745639 0.186267 0.035198 0.688764 0.800729 0.869042 0.983276
Positive affect 1927.0 NaN NaN NaN 0.709998 0.107106 0.32169 0.625373 0.722391 0.799276 0.943621
Negative affect 1933.0 NaN NaN NaN 0.268552 0.085176 0.082737 0.206403 0.258117 0.319716 0.70459
In [23]:
for c in df:
    print(f"[{c}]:\nTotal de Nulos:{df[c].isna().sum()}\tTotal percentual de Nulos: {round(100*df[c].isna().sum()/df.shape[0],3)}%\n")
[Country name]:
Total de Nulos:0	Total percentual de Nulos: 0.0%

[Regional indicator]:
Total de Nulos:1949	Total percentual de Nulos: 92.898%

[year]:
Total de Nulos:0	Total percentual de Nulos: 0.0%

[Ladder score]:
Total de Nulos:0	Total percentual de Nulos: 0.0%

[Logged GDP per capita]:
Total de Nulos:36	Total percentual de Nulos: 1.716%

[Social support]:
Total de Nulos:13	Total percentual de Nulos: 0.62%

[Healthy life expectancy]:
Total de Nulos:55	Total percentual de Nulos: 2.622%

[Freedom to make life choices]:
Total de Nulos:32	Total percentual de Nulos: 1.525%

[Generosity]:
Total de Nulos:89	Total percentual de Nulos: 4.242%

[Perceptions of corruption]:
Total de Nulos:110	Total percentual de Nulos: 5.243%

[Positive affect]:
Total de Nulos:171	Total percentual de Nulos: 8.151%

[Negative affect]:
Total de Nulos:165	Total percentual de Nulos: 7.865%

  • Aqui, qualquer quantidade de dados faltantes pode representar um problema, pois podem haver países com uma única amostra no dataset

Q3 - Influência da Pandemia, com dados faltantes¶

  • Os dados serão normalizados para melhor visualização
In [24]:
#na_ano = df.groupby(by=['Country name','year'],as_index=False).agg('mean')
pais_ano = df.groupby(by=['Country name','year']).agg('mean')
ao_ano_na = df.groupby(by='year',as_index=False).agg('mean')
na_scaler = MinMaxScaler()
sc_ano_na = pd.DataFrame(na_scaler.fit_transform(ao_ano_na.drop(columns='year')),columns=ao_ano_na.drop(columns='year').columns)
sc_ano_na['year'] = ao_ano_na['year']
sc_ano_na = sc_ano_na[[*ao_ano_na.columns]]
In [25]:
print(f"Sobre o scaler dos dados com na:\nFeatures: {na_scaler.feature_names_in_}\nValores Máximos: {na_scaler.data_max_}\nValores Mínimos: {na_scaler.data_min_}\nFaixa de valores: {na_scaler.feature_range}\nParâmetros gerais: {na_scaler.get_params()}")
Sobre o scaler dos dados com na:
Features: ['Ladder score' 'Logged GDP per capita' 'Social support'
 'Healthy life expectancy' 'Freedom to make life choices' 'Generosity'
 'Perceptions of corruption' 'Positive affect' 'Negative affect']
Valores Máximos: [ 6.44616427 10.11863786  0.89736686 67.0995636   0.82961793  0.25623003
  0.79206881  0.74856599  0.29270757]
Valores Mínimos: [ 5.19693529e+00  9.04429773e+00  7.84400843e-01  6.01475000e+01
  6.87328704e-01 -2.31054730e-02  7.07697230e-01  7.01571426e-01
  2.40694924e-01]
Faixa de valores: (0, 1)
Parâmetros gerais: {'clip': False, 'copy': True, 'feature_range': (0, 1)}
In [26]:
fig = go.Figure()
cols = [*sc_ano_na.columns]
cols.remove('year')
for column in cols:
    fig.add_trace(go.Scatter( x = sc_ano_na.year, y = sc_ano_na[column], name = column, mode = 'lines') )
fig.update_layout(title = "Indicadores Escalonados Globais por Ano [com dados faltantes]", xaxis_title = 'Ano')
fig.show()
fig.write_html('indicadores_global_norm_faltantes.html')
  • Todos os atributos foram impactados pela pandemia, embora Perceptions of corruption, Generosity tenham comportamento invertido.
  • É evidente o fato de a pandemia ter influenciado os indicadores
In [27]:
br_ano_na = df[df['Country name']=='Brazil'].groupby(by='year',as_index=False).agg('mean')
cols = [*br_ano_na.columns]
cols.remove('year')
fig = go.Figure()
for column in cols:
    fig.add_trace(go.Scatter( x = br_ano_na.year, y = br_ano_na[column], name = column, mode = 'lines') )
fig.update_layout(title = "Indicadores por Ano [Brasil, não-normalizados]", xaxis_title = 'Ano')
fig.show()
fig.write_html('indicadores_br_raw_faltantes.html')
  • Uma espiada no comportamento do meu país

Casos singulares de dados faltantes¶

  • São aqueles em que não há um histórico de comportamento para algum dos atributos, tornando irrecomendável substituir um valor faltante por uma média, e fazendo com que seja necessária a criação de algum modelo específico para aquele atributo.
  • Ou aqueles países que só possuem uma amostra.

Casos singulares¶

  • Para detectar os casos singulares foi adotada a medida de comparação onde a quantidade de dados faltantes em alguma das colunas deve ser uma unidade menor do que o total de amostras.
  • A China, por exemplo, contém 16 amostras mas 15 dados faltantes em "Perceptions of corruption"
In [28]:
casos_singulares = []
uma_amostra = []
for p in df['Country name'].unique():
    df_pais = df[df['Country name']==p]
    if (df_pais.drop(columns='Regional indicator').isna().sum() >= df_pais.shape[0]-1).any():
        print(f"País: {p}\nAmostras: {df_pais.shape[0]}\tTotal de faltantes: {df_pais.isna().sum().sum()}")
        display(pd.DataFrame(df_pais.drop(columns='Regional indicator').isna().sum()).T)
        casos_singulares.append(p)
    if df_pais.shape[0] == 1:
        uma_amostra.append(p)
País: China
Amostras: 16	Total de faltantes: 38
Country name year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 0 0 0 0 0 0 5 1 15 1 1
País: Cuba
Amostras: 1	Total de faltantes: 4
Country name year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 0 0 0 1 0 0 0 1 1 0 0
País: Guyana
Amostras: 1	Total de faltantes: 1
Country name year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 0 0 0 0 0 0 0 0 0 0 0
País: Hong Kong S.A.R. of China
Amostras: 12	Total de faltantes: 26
Country name year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 0 0 0 1 0 11 0 1 0 1 1
País: Kosovo
Amostras: 15	Total de faltantes: 34
Country name year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 0 0 0 1 0 14 1 1 0 2 1
País: Maldives
Amostras: 2	Total de faltantes: 6
Country name year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 0 0 0 0 0 0 0 0 1 2 2
País: North Cyprus
Amostras: 8	Total de faltantes: 30
Country name year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 0 0 0 7 0 7 0 7 0 1 1
País: Oman
Amostras: 1	Total de faltantes: 4
Country name year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 0 0 0 0 1 0 0 0 1 1 0
País: Qatar
Amostras: 5	Total de faltantes: 18
Country name year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 0 0 0 0 2 0 2 1 4 2 2
País: Somalia
Amostras: 3	Total de faltantes: 9
Country name year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 0 0 0 3 0 0 0 3 0 0 0
País: Somaliland region
Amostras: 4	Total de faltantes: 16
Country name year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 0 0 0 4 0 4 0 4 0 0 0
País: South Sudan
Amostras: 4	Total de faltantes: 12
Country name year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 0 0 0 4 0 0 0 4 0 0 0
País: Suriname
Amostras: 1	Total de faltantes: 1
Country name year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 0 0 0 0 0 0 0 0 0 0 0
País: Turkmenistan
Amostras: 11	Total de faltantes: 24
Country name year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 0 0 0 0 0 0 2 0 10 1 1
In [29]:
casos_singulares = list(set(casos_singulares)-set(uma_amostra))
print(f"Casos singulares de dados faltantes: {len(casos_singulares)}\n{[*casos_singulares]}")
Casos singulares de dados faltantes: 10
['North Cyprus', 'China', 'Qatar', 'Hong Kong S.A.R. of China', 'Maldives', 'Kosovo', 'South Sudan', 'Somalia', 'Somaliland region', 'Turkmenistan']

Uma Amostra¶

In [30]:
df[df['Country name'].isin(uma_amostra)]
Out[30]:
Country name Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
454 Cuba None 2006 5.417869 NaN 0.969595 68.440002 0.281458 NaN NaN 0.646712 0.276602
723 Guyana None 2007 5.992826 8.773289 0.848765 57.259998 0.694006 0.110037 0.835569 0.767541 0.296420
1414 Oman None 2011 6.852982 10.382462 NaN 65.500000 0.916293 0.024908 NaN NaN 0.295164
1759 Suriname None 2012 6.269287 9.797085 0.797262 62.240002 0.885488 -0.077173 0.751283 0.764223 0.250365

Alguns Exemplos¶

In [31]:
df[df['Country name']=='Kosovo']
Out[31]:
Country name Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
973 Kosovo None 2007 5.103906 8.927753 0.847812 NaN 0.381364 0.143901 0.894462 0.654866 0.236699
974 Kosovo None 2008 5.521660 8.980872 0.883843 NaN NaN 0.090464 0.849059 NaN 0.317828
975 Kosovo None 2009 5.891433 9.008162 0.830427 NaN 0.506415 0.200504 0.967839 0.597583 0.168830
976 Kosovo None 2010 5.176601 9.032693 0.707959 NaN 0.451444 0.169696 0.967272 0.695178 0.117717
977 Kosovo None 2011 4.859502 9.066925 0.759102 NaN 0.588979 0.003699 0.919212 0.695966 0.124438
978 Kosovo None 2012 5.639588 9.085688 0.757147 NaN 0.635793 0.027182 0.949651 0.595572 0.099630
979 Kosovo None 2013 6.125758 9.113430 0.720750 NaN 0.568463 0.114904 0.935095 0.691511 0.202731
980 Kosovo None 2014 5.000375 9.128522 0.705632 NaN 0.441391 0.012095 0.775201 0.636128 0.205950
981 Kosovo None 2015 5.077461 9.182307 0.805271 NaN 0.561048 0.180851 0.850647 0.753090 0.179989
982 Kosovo None 2016 5.759412 9.228177 0.823803 NaN 0.827399 0.124869 0.940898 0.703887 0.149607
983 Kosovo None 2017 6.149200 9.262030 0.792087 NaN 0.857677 0.117175 0.925192 0.738436 0.185879
984 Kosovo None 2018 6.391826 9.296085 0.822407 NaN 0.889737 0.268795 0.922078 0.778271 0.170248
985 Kosovo None 2019 6.425144 9.338535 0.842511 NaN 0.841190 0.246990 0.920297 0.748522 0.140792
986 Kosovo None 2020 6.294414 NaN 0.792374 NaN 0.879838 NaN 0.909894 0.726240 0.201458
987 Kosovo Central and Eastern Europe 2021 6.372000 9.318236 0.820958 63.812744 0.868972 0.257417 0.917488 NaN NaN
In [32]:
df[df['Country name']=='China']
Out[32]:
Country name Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
367 China None 2006 4.560495 8.696120 0.747011 66.879997 NaN NaN NaN 0.809295 0.169580
368 China None 2007 4.862862 8.823954 0.810852 67.059998 NaN -0.176243 NaN 0.817485 0.158614
369 China None 2008 4.846295 8.910992 0.748287 67.239998 0.853072 -0.092472 NaN 0.817443 0.146963
370 China None 2009 4.454361 8.995857 0.798034 67.419998 0.771143 -0.160481 NaN 0.785806 0.161650
371 China None 2010 4.652737 9.092104 0.767753 67.599998 0.804794 -0.133318 NaN 0.765265 0.158100
372 China None 2011 5.037208 9.178532 0.787171 67.760002 0.824162 -0.186383 NaN 0.820074 0.133503
373 China None 2012 5.094917 9.249320 0.787818 67.919998 0.808255 -0.184676 NaN 0.820785 0.158703
374 China None 2013 5.241090 9.319200 0.777896 68.080002 0.804724 -0.157777 NaN 0.836431 0.142211
375 China None 2014 5.195619 9.385755 0.820366 68.239998 NaN -0.216772 NaN 0.853975 0.111518
376 China None 2015 5.303878 9.448723 0.793734 68.400002 NaN -0.244435 NaN 0.808911 0.171315
377 China None 2016 5.324956 9.509552 0.741703 68.699997 NaN -0.227522 NaN 0.826144 0.145625
378 China None 2017 5.099061 9.571116 0.772033 69.000000 0.877618 -0.174832 NaN 0.821097 0.214005
379 China None 2018 5.131434 9.631892 0.787605 69.300003 0.895378 -0.158510 NaN 0.855784 0.189640
380 China None 2019 5.144120 9.687612 0.821936 69.599998 0.927356 -0.173036 NaN 0.890780 0.146512
381 China None 2020 5.771065 9.701755 0.808334 69.900002 0.891123 -0.103214 NaN 0.789345 0.244918
382 China East Asia 2021 5.339100 9.673172 0.810829 69.593407 0.904293 -0.145908 0.755389 NaN NaN



Regiões (Regional Indiator)¶

In [33]:
len(df['Regional indicator'].unique())
Out[33]:
11
  • Ao todo são 10 indicadores de regiões presentes no dataframe, mais um "None", introduzido no tratamento.
In [34]:
df['Regional indicator'].unique()
Out[34]:
array([None, 'South Asia', 'Central and Eastern Europe',
       'Middle East and North Africa', 'Latin America and Caribbean',
       'Commonwealth of Independent States', 'North America and ANZ',
       'Western Europe', 'Sub-Saharan Africa', 'Southeast Asia',
       'East Asia'], dtype=object)


Distribuição de países por regiões¶

In [35]:
# países por região
for ri in df['Regional indicator'].unique():
    print(f"{ri}: { len(df[df['Regional indicator'] == ri]['Country name'].unique())}, {df[df['Regional indicator'] == ri]['Country name'].unique()}\n")
None: 0, []

South Asia: 7, ['Afghanistan' 'Bangladesh' 'India' 'Maldives' 'Nepal' 'Pakistan'
 'Sri Lanka']

Central and Eastern Europe: 17, ['Albania' 'Bosnia and Herzegovina' 'Bulgaria' 'Croatia' 'Czech Republic'
 'Estonia' 'Hungary' 'Kosovo' 'Latvia' 'Lithuania' 'Montenegro'
 'North Macedonia' 'Poland' 'Romania' 'Serbia' 'Slovakia' 'Slovenia']

Middle East and North Africa: 17, ['Algeria' 'Bahrain' 'Egypt' 'Iran' 'Iraq' 'Israel' 'Jordan' 'Kuwait'
 'Lebanon' 'Libya' 'Morocco' 'Palestinian Territories' 'Saudi Arabia'
 'Tunisia' 'Turkey' 'United Arab Emirates' 'Yemen']

Latin America and Caribbean: 20, ['Argentina' 'Bolivia' 'Brazil' 'Chile' 'Colombia' 'Costa Rica'
 'Dominican Republic' 'Ecuador' 'El Salvador' 'Guatemala' 'Haiti'
 'Honduras' 'Jamaica' 'Mexico' 'Nicaragua' 'Panama' 'Paraguay' 'Peru'
 'Uruguay' 'Venezuela']

Commonwealth of Independent States: 12, ['Armenia' 'Azerbaijan' 'Belarus' 'Georgia' 'Kazakhstan' 'Kyrgyzstan'
 'Moldova' 'Russia' 'Tajikistan' 'Turkmenistan' 'Ukraine' 'Uzbekistan']

North America and ANZ: 4, ['Australia' 'Canada' 'New Zealand' 'United States']

Western Europe: 21, ['Austria' 'Belgium' 'Cyprus' 'Denmark' 'Finland' 'France' 'Germany'
 'Greece' 'Iceland' 'Ireland' 'Italy' 'Luxembourg' 'Malta' 'Netherlands'
 'North Cyprus' 'Norway' 'Portugal' 'Spain' 'Sweden' 'Switzerland'
 'United Kingdom']

Sub-Saharan Africa: 36, ['Benin' 'Botswana' 'Burkina Faso' 'Burundi' 'Cameroon' 'Chad' 'Comoros'
 'Congo (Brazzaville)' 'Ethiopia' 'Gabon' 'Gambia' 'Ghana' 'Guinea'
 'Ivory Coast' 'Kenya' 'Lesotho' 'Liberia' 'Madagascar' 'Malawi' 'Mali'
 'Mauritania' 'Mauritius' 'Mozambique' 'Namibia' 'Niger' 'Nigeria'
 'Rwanda' 'Senegal' 'Sierra Leone' 'South Africa' 'Swaziland' 'Tanzania'
 'Togo' 'Uganda' 'Zambia' 'Zimbabwe']

Southeast Asia: 9, ['Cambodia' 'Indonesia' 'Laos' 'Malaysia' 'Myanmar' 'Philippines'
 'Singapore' 'Thailand' 'Vietnam']

East Asia: 6, ['China' 'Hong Kong S.A.R. of China' 'Japan' 'Mongolia' 'South Korea'
 'Taiwan Province of China']



Países sem região determinada¶

  • Alguns países não possuem a região determinada no dataset
  • Para outros, pode-se utilizar a região determinada no dataset de 2021 para preencher os campos dos dados históricos
In [36]:
df[ (df['Country name'] == 'Afghanistan') ]
Out[36]:
Country name Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 Afghanistan None 2008 3.723590 7.370100 0.450662 50.799999 0.718114 0.167640 0.881686 0.517637 0.258195
1 Afghanistan None 2009 4.401778 7.539972 0.552308 51.200001 0.678896 0.190099 0.850035 0.583926 0.237092
2 Afghanistan None 2010 4.758381 7.646709 0.539075 51.599998 0.600127 0.120590 0.706766 0.618265 0.275324
3 Afghanistan None 2011 3.831719 7.619532 0.521104 51.919998 0.495901 0.162427 0.731109 0.611387 0.267175
4 Afghanistan None 2012 3.782938 7.705479 0.520637 52.240002 0.530935 0.236032 0.775620 0.710385 0.267919
5 Afghanistan None 2013 3.572100 7.725029 0.483552 52.560001 0.577955 0.061148 0.823204 0.620585 0.273328
6 Afghanistan None 2014 3.130896 7.718354 0.525568 52.880001 0.508514 0.104013 0.871242 0.531691 0.374861
7 Afghanistan None 2015 3.982855 7.701992 0.528597 53.200001 0.388928 0.079864 0.880638 0.553553 0.339276
8 Afghanistan None 2016 4.220169 7.696560 0.559072 53.000000 0.522566 0.042265 0.793246 0.564953 0.348332
9 Afghanistan None 2017 2.661718 7.697381 0.490880 52.799999 0.427011 -0.121303 0.954393 0.496349 0.371326
10 Afghanistan None 2018 2.694303 7.691767 0.507516 52.599998 0.373536 -0.093828 0.927606 0.424125 0.404904
11 Afghanistan None 2019 2.375092 7.697248 0.419973 52.400002 0.393656 -0.108459 0.923849 0.351387 0.502474
12 Afghanistan South Asia 2021 2.522900 7.694710 0.462596 52.492615 0.381749 -0.101684 0.924338 NaN NaN
In [37]:
df[ (df['Country name'] == 'Austria') ]
Out[37]:
Country name Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
86 Austria None 2006 7.122211 10.841940 0.936350 70.760002 0.941382 0.302386 0.490111 0.823105 0.173812
87 Austria None 2008 7.180954 10.886662 0.934593 71.080002 0.879069 0.291309 0.613625 0.832170 0.173195
88 Austria None 2010 7.302679 10.861471 0.914193 71.400002 0.895980 0.130891 0.546145 0.814719 0.155793
89 Austria None 2011 7.470513 10.886909 0.944157 71.540001 0.939356 0.131578 0.702721 0.789471 0.145238
90 Austria None 2012 7.400689 10.889132 0.945142 71.680000 0.919704 0.117804 0.770586 0.822248 0.156675
91 Austria None 2013 7.498803 10.883492 0.949809 71.820000 0.921734 0.168248 0.678937 0.787313 0.162603
92 Austria None 2014 6.950000 10.882268 0.898920 71.959999 0.885027 0.117607 0.566931 0.779693 0.170150
93 Austria None 2015 7.076447 10.881152 0.928110 72.099998 0.900305 0.098893 0.557480 0.798263 0.164469
94 Austria None 2016 7.048072 10.890950 0.926319 72.400002 0.888514 0.079749 0.523641 0.755903 0.197424
95 Austria None 2017 7.293728 10.908466 0.906218 72.699997 0.890031 0.133064 0.518304 0.747569 0.180269
96 Austria None 2018 7.396002 10.927505 0.911668 73.000000 0.904112 0.053470 0.523061 0.752350 0.226059
97 Austria None 2019 7.195361 10.939381 0.964489 73.300003 0.903428 0.059686 0.457089 0.774459 0.205170
98 Austria None 2020 7.213489 10.851118 0.924831 73.599998 0.911910 0.011032 0.463830 0.769317 0.206500
99 Austria Western Europe 2021 7.267800 10.906316 0.934176 73.299721 0.907691 0.041568 0.481378 NaN NaN

Preenchimento de regiões com valores dados em 2021¶

In [38]:
# preenchendo campos com dados já fornecidos
for p in df['Country name'].unique():
    regs = [ *df[ df['Country name'] == p ]['Regional indicator'].unique() ]
    regs.remove(None)
    if len(regs) == 1:
        df.loc[df[ df['Country name'] == p ].index,['Regional indicator']] = regs[0]
  • Conferência das mudanças
In [39]:
df[ (df['Country name'] == 'Afghanistan') ]
Out[39]:
Country name Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 Afghanistan South Asia 2008 3.723590 7.370100 0.450662 50.799999 0.718114 0.167640 0.881686 0.517637 0.258195
1 Afghanistan South Asia 2009 4.401778 7.539972 0.552308 51.200001 0.678896 0.190099 0.850035 0.583926 0.237092
2 Afghanistan South Asia 2010 4.758381 7.646709 0.539075 51.599998 0.600127 0.120590 0.706766 0.618265 0.275324
3 Afghanistan South Asia 2011 3.831719 7.619532 0.521104 51.919998 0.495901 0.162427 0.731109 0.611387 0.267175
4 Afghanistan South Asia 2012 3.782938 7.705479 0.520637 52.240002 0.530935 0.236032 0.775620 0.710385 0.267919
5 Afghanistan South Asia 2013 3.572100 7.725029 0.483552 52.560001 0.577955 0.061148 0.823204 0.620585 0.273328
6 Afghanistan South Asia 2014 3.130896 7.718354 0.525568 52.880001 0.508514 0.104013 0.871242 0.531691 0.374861
7 Afghanistan South Asia 2015 3.982855 7.701992 0.528597 53.200001 0.388928 0.079864 0.880638 0.553553 0.339276
8 Afghanistan South Asia 2016 4.220169 7.696560 0.559072 53.000000 0.522566 0.042265 0.793246 0.564953 0.348332
9 Afghanistan South Asia 2017 2.661718 7.697381 0.490880 52.799999 0.427011 -0.121303 0.954393 0.496349 0.371326
10 Afghanistan South Asia 2018 2.694303 7.691767 0.507516 52.599998 0.373536 -0.093828 0.927606 0.424125 0.404904
11 Afghanistan South Asia 2019 2.375092 7.697248 0.419973 52.400002 0.393656 -0.108459 0.923849 0.351387 0.502474
12 Afghanistan South Asia 2021 2.522900 7.694710 0.462596 52.492615 0.381749 -0.101684 0.924338 NaN NaN
In [40]:
df[ df['Country name'] == 'Austria' ]
Out[40]:
Country name Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
86 Austria Western Europe 2006 7.122211 10.841940 0.936350 70.760002 0.941382 0.302386 0.490111 0.823105 0.173812
87 Austria Western Europe 2008 7.180954 10.886662 0.934593 71.080002 0.879069 0.291309 0.613625 0.832170 0.173195
88 Austria Western Europe 2010 7.302679 10.861471 0.914193 71.400002 0.895980 0.130891 0.546145 0.814719 0.155793
89 Austria Western Europe 2011 7.470513 10.886909 0.944157 71.540001 0.939356 0.131578 0.702721 0.789471 0.145238
90 Austria Western Europe 2012 7.400689 10.889132 0.945142 71.680000 0.919704 0.117804 0.770586 0.822248 0.156675
91 Austria Western Europe 2013 7.498803 10.883492 0.949809 71.820000 0.921734 0.168248 0.678937 0.787313 0.162603
92 Austria Western Europe 2014 6.950000 10.882268 0.898920 71.959999 0.885027 0.117607 0.566931 0.779693 0.170150
93 Austria Western Europe 2015 7.076447 10.881152 0.928110 72.099998 0.900305 0.098893 0.557480 0.798263 0.164469
94 Austria Western Europe 2016 7.048072 10.890950 0.926319 72.400002 0.888514 0.079749 0.523641 0.755903 0.197424
95 Austria Western Europe 2017 7.293728 10.908466 0.906218 72.699997 0.890031 0.133064 0.518304 0.747569 0.180269
96 Austria Western Europe 2018 7.396002 10.927505 0.911668 73.000000 0.904112 0.053470 0.523061 0.752350 0.226059
97 Austria Western Europe 2019 7.195361 10.939381 0.964489 73.300003 0.903428 0.059686 0.457089 0.774459 0.205170
98 Austria Western Europe 2020 7.213489 10.851118 0.924831 73.599998 0.911910 0.011032 0.463830 0.769317 0.206500
99 Austria Western Europe 2021 7.267800 10.906316 0.934176 73.299721 0.907691 0.041568 0.481378 NaN NaN



In [41]:
df['Regional indicator'].isna().sum()
Out[41]:
63
  • Ainda assim há 63 registros sem região

Países sem região definida¶

In [42]:
sem_regiao = [*df[df['Regional indicator'].isna()]['Country name'].unique()]
sem_regiao
Out[42]:
['Angola',
 'Belize',
 'Bhutan',
 'Central African Republic',
 'Congo (Kinshasa)',
 'Cuba',
 'Djibouti',
 'Guyana',
 'Oman',
 'Qatar',
 'Somalia',
 'Somaliland region',
 'South Sudan',
 'Sudan',
 'Suriname',
 'Syria',
 'Trinidad and Tobago']

Cruzamento com dados da ONU¶

Tartamento dos dados da ONU¶

In [43]:
# colunas do dataset da Onu
input_dfs['regions_un'].columns
Out[43]:
Index(['Country or area ', 'Major area ', 'Region ', 'Development region'], dtype='object')
In [44]:
# Países no dataset da Onu
input_dfs['regions_un']['Country or area '].unique()[:5]
Out[44]:
array(['Afghanistan ', 'Albania ', 'Algeria ', 'American Samoa ',
       'Andorra '], dtype=object)
In [45]:
# removendo espaços vazios das células
for c in input_dfs['regions_un']:
    input_dfs['regions_un'][c] = [ x.strip() for x in input_dfs['regions_un'][c] ]
# removendo espaços vazios dos nomes das colunas
input_dfs['regions_un'].columns = [ x.strip() for x in input_dfs['regions_un'].columns ]
input_dfs['regions_un']['Country or area'].unique()[:5]
Out[45]:
array(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra'],
      dtype=object)

Criação de colunas auxiliares no Dataframe¶

In [46]:
df['Regional_indicator_consultado_Major'] = None
df['Regional_indicator_consultado'] = None
df = df[['Country name','Regional indicator','Regional_indicator_consultado_Major','Regional_indicator_consultado','year',
         'Ladder score','Logged GDP per capita','Social support','Healthy life expectancy','Freedom to make life choices',
         'Generosity','Perceptions of corruption','Positive affect','Negative affect']]
  • Em alguns casos o dataset fornecido usa o campo 'Major area' da ONU e em outros, o 'Region'
In [47]:
# da ONU
input_dfs['regions_un'][input_dfs['regions_un']['Country or area'] == 'Brazil']
Out[47]:
Country or area Major area Region Development region
27 Brazil Latin America and the Caribbean South America Less developed regions
In [48]:
# dos Dados fornecidos
df[df['Country name'] == 'Brazil'].head(1)
Out[48]:
Country name Regional indicator Regional_indicator_consultado_Major Regional_indicator_consultado year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
234 Brazil Latin America and Caribbean None None 2005 6.636771 9.438417 0.882923 63.299999 0.882186 NaN 0.744994 0.818337 0.30178
In [49]:
# prenchimento das colunas auxiliares com dados da ONU
nao_encontrados = []
for p in df['Country name'].unique():
    if p in input_dfs['regions_un']['Country or area'].unique():
        df.loc[df[ df['Country name'] == p ].index,['Regional_indicator_consultado_Major']] = input_dfs['regions_un'][input_dfs['regions_un']['Country or area'] == p]['Major area'].values[0]
        df.loc[df[ df['Country name'] == p ].index,['Regional_indicator_consultado']] = input_dfs['regions_un'][input_dfs['regions_un']['Country or area'] == p]['Region'].values[0]
    else:
        nao_encontrados.append(p)
  • Caso algum país não seja encontrado, este será adicionado à lista de "não encontrados"

Novo Formato¶

In [50]:
# novo formato
df[df['Country name'] == 'Brazil'].head(1)
Out[50]:
Country name Regional indicator Regional_indicator_consultado_Major Regional_indicator_consultado year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
234 Brazil Latin America and Caribbean Latin America and the Caribbean South America 2005 6.636771 9.438417 0.882923 63.299999 0.882186 NaN 0.744994 0.818337 0.30178
In [51]:
df[ df['Country name'] == 'Oman' ]
Out[51]:
Country name Regional indicator Regional_indicator_consultado_Major Regional_indicator_consultado year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
1414 Oman None Asia Western Asia 2011 6.852982 10.382462 NaN 65.5 0.916293 0.024908 NaN NaN 0.295164

Conferência dos dados das regiões¶

  • Regional indicator (original)
In [52]:
df['Regional indicator'].value_counts()
Out[52]:
Sub-Saharan Africa                    426
Latin America and Caribbean           299
Western Europe                        292
Central and Eastern Europe            242
Middle East and North Africa          228
Commonwealth of Independent States    182
Southeast Asia                        125
South Asia                             91
East Asia                              88
North America and ANZ                  62
Name: Regional indicator, dtype: int64
  • Regional indicator (ONU)
In [53]:
df['Regional_indicator_consultado'].value_counts()
Out[53]:
Western Asia                 209
Southern Europe              173
Western Africa               162
Eastern Africa               148
Northern Europe              127
South America                127
Eastern Europe               114
Central America              109
South-Eastern Asia           100
Western Europe                99
Southern Asia                 94
Central Asia                  73
Northern Africa               58
Middle Africa                 50
Eastern Asia                  46
Southern Africa               45
Caribbean                     41
Australia and New Zealand     30
Northern America              16
Name: Regional_indicator_consultado, dtype: int64
  • Major (ONU)
In [54]:
df['Regional_indicator_consultado_Major'].value_counts()
Out[54]:
Asia                               522
Europe                             513
Africa                             463
Latin America and the Caribbean    277
Oceania                             30
Northern America                    16
Name: Regional_indicator_consultado_Major, dtype: int64

Distribuição das regiões originais e consultadas por macorregiões da ONU¶

In [55]:
for mr in df['Regional_indicator_consultado_Major'].unique():
    print(df[df['Regional_indicator_consultado_Major'] == mr][['Regional_indicator_consultado_Major','Regional indicator','Regional_indicator_consultado']].value_counts())
Regional_indicator_consultado_Major  Regional indicator                  Regional_indicator_consultado
Asia                                 Middle East and North Africa        Western Asia                     143
                                     Southeast Asia                      South-Eastern Asia               100
                                     South Asia                          Southern Asia                     91
                                     Commonwealth of Independent States  Central Asia                      73
                                                                         Western Asia                      46
                                     East Asia                           Eastern Asia                      46
                                     Western Europe                      Western Asia                      14
dtype: int64
Regional_indicator_consultado_Major  Regional indicator                  Regional_indicator_consultado
Europe                               Central and Eastern Europe          Southern Europe                  99
                                     Western Europe                      Western Europe                   99
                                     Central and Eastern Europe          Eastern Europe                   83
                                     Western Europe                      Northern Europe                  81
                                                                         Southern Europe                  74
                                     Central and Eastern Europe          Northern Europe                  46
                                     Commonwealth of Independent States  Eastern Europe                   31
dtype: int64
Regional_indicator_consultado_Major  Regional indicator            Regional_indicator_consultado
Africa                               Sub-Saharan Africa            Western Africa                   162
                                                                   Eastern Africa                   141
                                     Middle East and North Africa  Northern Africa                   49
                                     Sub-Saharan Africa            Southern Africa                   45
                                                                   Middle Africa                     41
dtype: int64
Regional_indicator_consultado_Major  Regional indicator           Regional_indicator_consultado
Latin America and the Caribbean      Latin America and Caribbean  South America                    125
                                                                  Central America                  107
                                                                  Caribbean                         35
dtype: int64
Regional_indicator_consultado_Major  Regional indicator     Regional_indicator_consultado
Oceania                              North America and ANZ  Australia and New Zealand        30
dtype: int64
Series([], dtype: int64)
Regional_indicator_consultado_Major  Regional indicator     Regional_indicator_consultado
Northern America                     North America and ANZ  Northern America                 16
dtype: int64



Última conferência de regiões faltantes¶

In [56]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2098 entries, 0 to 2097
Data columns (total 14 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Country name                         2098 non-null   object 
 1   Regional indicator                   2035 non-null   object 
 2   Regional_indicator_consultado_Major  1821 non-null   object 
 3   Regional_indicator_consultado        1821 non-null   object 
 4   year                                 2098 non-null   int64  
 5   Ladder score                         2098 non-null   float64
 6   Logged GDP per capita                2062 non-null   float64
 7   Social support                       2085 non-null   float64
 8   Healthy life expectancy              2043 non-null   float64
 9   Freedom to make life choices         2066 non-null   float64
 10  Generosity                           2009 non-null   float64
 11  Perceptions of corruption            1988 non-null   float64
 12  Positive affect                      1927 non-null   float64
 13  Negative affect                      1933 non-null   float64
dtypes: float64(9), int64(1), object(4)
memory usage: 310.4+ KB
  • Por algum motivo ainda há muitos dados faltantes sobre as regiões
  • Nomes dos países na base da ONU
In [57]:
input_dfs['regions_un']['Country or area'].unique()
Out[57]:
array(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra',
       'Angola', 'Anguilla', 'Antigua and Barbuda', 'Argentina',
       'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan',
       'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
       'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan',
       'Bolivia (Plurinational State of)', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'British Virgin Islands',
       'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi',
       'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands',
       'Central African Republic', 'Chad', 'Channel Islands', 'Chile',
       'China', 'Colombia', 'Comoros', 'Congo', 'Cook Islands',
       'Costa Rica', "Côte d'Ivoire", 'Croatia', 'Cuba', 'Cyprus',
       'Czech Republic', "Democratic People's Republic of Korea",
       'Democratic Republic of the Congo', 'Denmark', 'Djibouti',
       'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Ethiopia', 'Faeroe Islands', 'Falkland Islands (Malvinas)',
       'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia',
       'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar',
       'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam',
       'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti',
       'Holy See', 'Honduras',
       'China, Hong Kong Special Administrative Region', 'Hungary',
       'Iceland', 'India', 'Indonesia', 'Iran (Islamic Republic of)',
       'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica',
       'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kuwait',
       'Kyrgyzstan', "Lao People's Democratic Republic", 'Latvia',
       'Lebanon', 'Lesotho', 'Liberia', 'Libyan Arab Jamahiriya',
       'Liechtenstein', 'Lithuania', 'Luxembourg',
       'China, Macao Special Administrative Region', 'Madagascar',
       'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta',
       'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius',
       'Mayotte', 'Mexico', 'Micronesia (Federated States of)', 'Monaco',
       'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique',
       'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands',
       'Netherlands Antilles', 'New Caledonia', 'New Zealand',
       'Nicaragua', 'Niger', 'Nigeria', 'Niue',
       'Northern Mariana Islands', 'Norway',
       'Occupied Palestinian Territory', 'Oman', 'Pakistan', 'Palau',
       'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines',
       'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar',
       'Republic of Korea', 'Republic of Moldova', 'Réunion', 'Romania',
       'Russian Federation', 'Rwanda', 'Saint Helena',
       'Saint Kitts and Nevis', 'Saint Lucia',
       'Saint Pierre and Miquelon', 'Saint Vincent and the Grenadines',
       'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia',
       'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore',
       'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia',
       'South Africa', 'Spain', 'Sri Lanka', 'South Sudan', 'Sudan',
       'Suriname', 'Swaziland', 'Sweden', 'Switzerland',
       'Syrian Arab Republic', 'Tajikistan', 'Thailand',
       'The former Yugoslav Republic of Macedonia', 'Timor-Leste', 'Togo',
       'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey',
       'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda',
       'Ukraine', 'United Arab Emirates',
       'United Kingdom of Great Britain and Northern Ireland',
       'United Republic of Tanzania', 'United States of America',
       'United States Virgin Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu',
       'Venezuela (Bolivarian Republic of)', 'Viet Nam',
       'Wallis and Futuna Islands', 'Western Sahara', 'Yemen', 'Zambia',
       'Zimbabwe', 'Czechoslovakia (former)',
       'German Democratic Republic', 'Åland Islands', 'Norfolk Island',
       'Saint-Barthélemy', 'Saint-Martin (French part)', 'Jersey',
       'Svalbard and Jan Mayen Islands', 'USSR (former)',
       'Yugoslavia (former)', 'Serbia and Montenegro (former)',
       'Egypt and Sudan', 'Nordic countries', 'Other Africa',
       'Bangladesh, India and Sri Lanka',
       'Pacific Islands Trust Territories', 'Kosovo',
       'USSR (former) - unknown', 'USSR (former) - European countries',
       'USSR (former) - Asian countries', 'Democratic Yemen (former)',
       'Other Latin America and the Caribbean', 'Other Northern America',
       'Other Polynesia', 'Other Europe', 'European Union',
       'Other Oceania', 'Other Northern Africa', 'Other Caribbean',
       'Caribbean Commonwealth (West Indies)', 'Other Central America',
       'Other South-Eastern Asia', 'Other South America', 'Other Asia',
       'Taiwan, Province of China', 'Other Commonwealth',
       'Other Micronesia', 'Other and unknown', 'Other Middle East',
       'Other', 'Unknown', 'Stateless', 'African Commonwealth',
       'Other Non-Commonwealth', 'Baltic states', 'Guernsey',
       'European Union-15', 'European Union-8', 'Other European Union',
       'Old Commonwealth', 'New Commonwealth', 'European Union-12',
       'Australia and New Zealand', 'Asia Commonwealth',
       'America Commonwealth', 'Oceania Commonwealth',
       'Europe Commonwealth', 'Africa Commonwealth'], dtype=object)
  • Nomes dos países não encontrados na base da ONU
In [58]:
# países não encontrados
nao_encontrados
Out[58]:
['Bolivia',
 'Congo (Brazzaville)',
 'Congo (Kinshasa)',
 'Hong Kong S.A.R. of China',
 'Iran',
 'Ivory Coast',
 'Laos',
 'Libya',
 'Moldova',
 'North Cyprus',
 'North Macedonia',
 'Palestinian Territories',
 'Russia',
 'Somaliland region',
 'South Korea',
 'Syria',
 'Taiwan Province of China',
 'Tanzania',
 'United Kingdom',
 'United States',
 'Venezuela',
 'Vietnam']
  • Países não encontrados na base da ONU e que originalmente não possuem Região definida
In [59]:
df[ (df['Country name'].isin(nao_encontrados)) & (df['Regional indicator'].isna()) ]['Country name'].unique()
#['Regional indicator'].unique()
Out[59]:
array(['Congo (Kinshasa)', 'Somaliland region', 'Syria'], dtype=object)
In [ ]:
 

Análise de nomes dos países¶

  • Alguns países possuem nomes diferentes nas bases
  • Dicionário de cruzamento de nomes de países
In [60]:
# Cruzamento de nomes de países entre as bases
obs = {'Bolivia':'Bolivia (Plurinational State of)', 'Congo (Brazzaville)':'Congo', 'Hong Kong S.A.R. of China':'China, Hong Kong Special Administrative Region',
       'Iran':'Iran (Islamic Republic of)', 'Ivory Coast':"Côte d'Ivoire", 'Laos':"Lao People's Democratic Republic", 'Libya':'Libyan Arab Jamahiriya',
       'Moldova':'Republic of Moldova', 'North Cyprus':'Cyprus', 'North Macedonia':'The former Yugoslav Republic of Macedonia', 'Palestinian Territories':'Occupied Palestinian Territory',
       'Russia':'Russian Federation', 'South Korea':'Republic of Korea', 'Taiwan Province of China':"Taiwan, Province of China", 'Tanzania':'United Republic of Tanzania',
       'United Kingdom':'United Kingdom of Great Britain and Northern Ireland', 'United States':'United States of America', 'Venezuela':'Venezuela (Bolivarian Republic of)',
       'Vietnam':'Viet Nam','Congo (Kinshasa)':'Democratic Republic of the Congo', 'Somaliland region':'Somalia', 'Syria':'Syrian Arab Republic'}
# Somaliland é um país independente e diferente da somália, mas as macroregiões são as mesmas
In [61]:
# preenchimento dos dados faltantes após o cruzamento
for p in obs.keys():
    if obs[p] in input_dfs['regions_un']['Country or area'].unique():
        df.loc[df[ df['Country name'] == p ].index,['Regional_indicator_consultado_Major']] = input_dfs['regions_un'][input_dfs['regions_un']['Country or area'] == obs[p]]['Major area'].values[0]
        df.loc[df[ df['Country name'] == p ].index,['Regional_indicator_consultado']] = input_dfs['regions_un'][input_dfs['regions_un']['Country or area'] == obs[p]]['Region'].values[0]
In [62]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2098 entries, 0 to 2097
Data columns (total 14 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Country name                         2098 non-null   object 
 1   Regional indicator                   2035 non-null   object 
 2   Regional_indicator_consultado_Major  2098 non-null   object 
 3   Regional_indicator_consultado        2098 non-null   object 
 4   year                                 2098 non-null   int64  
 5   Ladder score                         2098 non-null   float64
 6   Logged GDP per capita                2062 non-null   float64
 7   Social support                       2085 non-null   float64
 8   Healthy life expectancy              2043 non-null   float64
 9   Freedom to make life choices         2066 non-null   float64
 10  Generosity                           2009 non-null   float64
 11  Perceptions of corruption            1988 non-null   float64
 12  Positive affect                      1927 non-null   float64
 13  Negative affect                      1933 non-null   float64
dtypes: float64(9), int64(1), object(4)
memory usage: 310.4+ KB
  • Todas as macrorregiões e regiões dos países foram encontradas!
  • O campo original "Regional indicator" foi preservado para fins de análise, e é de grande interesse que apresente dados faltantes
In [63]:
df.to_csv('df_com_zonas.csv')




Preenchimento de dados faltantes¶

  • Substituindo os campos vazios pelos valores médios de cada atributo, para cada país, exceto para os casos onde não faria sentido, isto é, para os países que só possuem um registro ou cujos registros em um atributo possuam apenas um valor válido e os demais nulos.
In [64]:
country_names = [*df['Country name'].unique()]
cn_no_commons = list(set(country_names) - set(casos_singulares) - set(uma_amostra))

for p in df['Country name'].unique():
#for p in cn_no_commons:
    #print(p)
    for col in df[df['Country name'] == p].drop(columns=['Country name','Regional indicator','Regional_indicator_consultado_Major','Regional_indicator_consultado']):
        #print('\t',col)
        nadf = df[ (df['Country name'] == p) & (df[col].isna()) ]
        #if nadf.shape[0] > 0:
        #    print(nadf.shape)
        if nadf.shape[0] > 0 and nadf.shape[0] <= df[ df['Country name'] == p ].shape[0]-2:
            df.loc[ nadf.index,col ] = df[df['Country name'] == p][col].mean()
In [65]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2098 entries, 0 to 2097
Data columns (total 14 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Country name                         2098 non-null   object 
 1   Regional indicator                   2035 non-null   object 
 2   Regional_indicator_consultado_Major  2098 non-null   object 
 3   Regional_indicator_consultado        2098 non-null   object 
 4   year                                 2098 non-null   int64  
 5   Ladder score                         2098 non-null   float64
 6   Logged GDP per capita                2079 non-null   float64
 7   Social support                       2097 non-null   float64
 8   Healthy life expectancy              2062 non-null   float64
 9   Freedom to make life choices         2098 non-null   float64
 10  Generosity                           2079 non-null   float64
 11  Perceptions of corruption            2066 non-null   float64
 12  Positive affect                      2095 non-null   float64
 13  Negative affect                      2096 non-null   float64
dtypes: float64(9), int64(1), object(4)
memory usage: 310.4+ KB


Países de comportamento singular e exmplos de campos para os quais seriam necessário modelos específicos¶

In [66]:
print(f"Singulares: {casos_singulares}\nUma linha: {uma_amostra}")
Singulares: ['North Cyprus', 'China', 'Qatar', 'Hong Kong S.A.R. of China', 'Maldives', 'Kosovo', 'South Sudan', 'Somalia', 'Somaliland region', 'Turkmenistan']
Uma linha: ['Cuba', 'Guyana', 'Oman', 'Suriname']
In [67]:
df[df['Country name'] == 'Hong Kong S.A.R. of China']
Out[67]:
Country name Regional indicator Regional_indicator_consultado_Major Regional_indicator_consultado year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
751 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2006 5.511187 10.746425 0.812178 NaN 0.909820 0.155567 0.355985 0.723260 0.235955
752 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2008 5.137262 10.815545 0.840222 NaN 0.922211 0.296268 0.273945 0.718972 0.236634
753 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2009 5.397056 10.788494 0.834716 NaN 0.918026 0.307638 0.272125 0.762151 0.210104
754 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2010 5.642835 10.846634 0.857314 NaN 0.890418 0.331955 0.255775 0.710370 0.183106
755 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2011 5.474011 10.886932 0.846060 NaN 0.894330 0.234555 0.244887 0.733887 0.195712
756 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2012 5.483765 10.892753 0.826426 NaN 0.879752 0.222402 0.379783 0.715137 0.183349
757 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2014 5.458051 10.939503 0.833558 NaN 0.843082 0.223799 0.422960 0.683968 0.242868
758 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2016 5.498421 10.969857 0.832078 NaN 0.799743 0.100235 0.402813 0.664093 0.213115
759 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2017 5.362475 10.999584 0.831066 NaN 0.830657 0.140063 0.415810 0.639533 0.200593
760 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2019 5.659317 11.000313 0.855826 NaN 0.726852 0.067344 0.431974 0.599320 0.357607
761 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2020 5.295341 10.898759 0.812943 NaN 0.705452 0.195197 0.380351 0.608647 0.210314
762 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2021 5.476700 11.000313 0.835781 76.820091 0.716808 0.067344 0.402650 0.687213 0.224487
In [68]:
df[df['Logged GDP per capita'].isna()]
Out[68]:
Country name Regional indicator Regional_indicator_consultado_Major Regional_indicator_consultado year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
454 Cuba None Latin America and the Caribbean Caribbean 2006 5.417869 NaN 0.969595 68.440002 0.281458 NaN NaN 0.646712 0.276602
1381 North Cyprus Western Europe Asia Western Asia 2012 5.463305 NaN 0.871150 NaN 0.692568 NaN 0.854730 0.709236 0.405435
1382 North Cyprus Western Europe Asia Western Asia 2013 5.566803 NaN 0.869274 NaN 0.775383 NaN 0.715356 0.621554 0.442972
1383 North Cyprus Western Europe Asia Western Asia 2014 5.785979 NaN 0.801802 NaN 0.829677 NaN 0.692221 0.723842 0.311336
1384 North Cyprus Western Europe Asia Western Asia 2015 5.842550 NaN 0.791383 NaN 0.785353 NaN 0.659180 0.701609 0.318930
1385 North Cyprus Western Europe Asia Western Asia 2016 5.827128 NaN 0.807690 NaN 0.796234 NaN 0.670191 0.643664 0.346465
1386 North Cyprus Western Europe Asia Western Asia 2018 5.608056 NaN 0.837392 NaN 0.797066 NaN 0.613837 0.480453 0.261868
1387 North Cyprus Western Europe Asia Western Asia 2019 5.466615 NaN 0.803295 NaN 0.792735 NaN 0.640059 0.493693 0.296411
1681 Somalia None Africa Eastern Africa 2014 5.528273 NaN 0.610836 49.599998 0.873879 NaN 0.456470 0.834454 0.207215
1682 Somalia None Africa Eastern Africa 2015 5.353645 NaN 0.599281 50.099998 0.967869 NaN 0.410236 0.900668 0.186736
1683 Somalia None Africa Eastern Africa 2016 4.667941 NaN 0.594417 50.000000 0.917323 NaN 0.440802 0.891423 0.193282
1684 Somaliland region None Africa Eastern Africa 2009 4.991400 NaN 0.879567 NaN 0.746304 NaN 0.513372 0.818879 0.112012
1685 Somaliland region None Africa Eastern Africa 2010 4.657363 NaN 0.829005 NaN 0.820182 NaN 0.471094 0.769375 0.083426
1686 Somaliland region None Africa Eastern Africa 2011 4.930572 NaN 0.787962 NaN 0.858104 NaN 0.357341 0.748686 0.122244
1687 Somaliland region None Africa Eastern Africa 2012 5.057314 NaN 0.786291 NaN 0.758219 NaN 0.333832 0.735189 0.152428
1720 South Sudan None Africa Northern Africa 2014 3.831992 NaN 0.545118 49.840000 0.567259 NaN 0.741541 0.614024 0.428320
1721 South Sudan None Africa Northern Africa 2015 4.070771 NaN 0.584781 50.200001 0.511631 NaN 0.709606 0.586278 0.449795
1722 South Sudan None Africa Northern Africa 2016 2.888112 NaN 0.532152 50.599998 0.439919 NaN 0.785318 0.614771 0.549257
1723 South Sudan None Africa Northern Africa 2017 2.816622 NaN 0.556823 51.000000 0.456011 NaN 0.761270 0.585602 0.517364
In [69]:
df[df['Social support'].isna()]
Out[69]:
Country name Regional indicator Regional_indicator_consultado_Major Regional_indicator_consultado year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
1414 Oman None Asia Western Asia 2011 6.852982 10.382462 NaN 65.5 0.916293 0.024908 NaN NaN 0.295164
In [70]:
df[df['Healthy life expectancy'].isna()]
Out[70]:
Country name Regional indicator Regional_indicator_consultado_Major Regional_indicator_consultado year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
751 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2006 5.511187 10.746425 0.812178 NaN 0.909820 0.155567 0.355985 0.723260 0.235955
752 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2008 5.137262 10.815545 0.840222 NaN 0.922211 0.296268 0.273945 0.718972 0.236634
753 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2009 5.397056 10.788494 0.834716 NaN 0.918026 0.307638 0.272125 0.762151 0.210104
754 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2010 5.642835 10.846634 0.857314 NaN 0.890418 0.331955 0.255775 0.710370 0.183106
755 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2011 5.474011 10.886932 0.846060 NaN 0.894330 0.234555 0.244887 0.733887 0.195712
756 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2012 5.483765 10.892753 0.826426 NaN 0.879752 0.222402 0.379783 0.715137 0.183349
757 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2014 5.458051 10.939503 0.833558 NaN 0.843082 0.223799 0.422960 0.683968 0.242868
758 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2016 5.498421 10.969857 0.832078 NaN 0.799743 0.100235 0.402813 0.664093 0.213115
759 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2017 5.362475 10.999584 0.831066 NaN 0.830657 0.140063 0.415810 0.639533 0.200593
760 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2019 5.659317 11.000313 0.855826 NaN 0.726852 0.067344 0.431974 0.599320 0.357607
761 Hong Kong S.A.R. of China East Asia Asia Eastern Asia 2020 5.295341 10.898759 0.812943 NaN 0.705452 0.195197 0.380351 0.608647 0.210314
973 Kosovo Central and Eastern Europe Europe Southern Europe 2007 5.103906 8.927753 0.847812 NaN 0.381364 0.143901 0.894462 0.654866 0.236699
974 Kosovo Central and Eastern Europe Europe Southern Europe 2008 5.521660 8.980872 0.883843 NaN 0.664265 0.090464 0.849059 0.693481 0.317828
975 Kosovo Central and Eastern Europe Europe Southern Europe 2009 5.891433 9.008162 0.830427 NaN 0.506415 0.200504 0.967839 0.597583 0.168830
976 Kosovo Central and Eastern Europe Europe Southern Europe 2010 5.176601 9.032693 0.707959 NaN 0.451444 0.169696 0.967272 0.695178 0.117717
977 Kosovo Central and Eastern Europe Europe Southern Europe 2011 4.859502 9.066925 0.759102 NaN 0.588979 0.003699 0.919212 0.695966 0.124438
978 Kosovo Central and Eastern Europe Europe Southern Europe 2012 5.639588 9.085688 0.757147 NaN 0.635793 0.027182 0.949651 0.595572 0.099630
979 Kosovo Central and Eastern Europe Europe Southern Europe 2013 6.125758 9.113430 0.720750 NaN 0.568463 0.114904 0.935095 0.691511 0.202731
980 Kosovo Central and Eastern Europe Europe Southern Europe 2014 5.000375 9.128522 0.705632 NaN 0.441391 0.012095 0.775201 0.636128 0.205950
981 Kosovo Central and Eastern Europe Europe Southern Europe 2015 5.077461 9.182307 0.805271 NaN 0.561048 0.180851 0.850647 0.753090 0.179989
982 Kosovo Central and Eastern Europe Europe Southern Europe 2016 5.759412 9.228177 0.823803 NaN 0.827399 0.124869 0.940898 0.703887 0.149607
983 Kosovo Central and Eastern Europe Europe Southern Europe 2017 6.149200 9.262030 0.792087 NaN 0.857677 0.117175 0.925192 0.738436 0.185879
984 Kosovo Central and Eastern Europe Europe Southern Europe 2018 6.391826 9.296085 0.822407 NaN 0.889737 0.268795 0.922078 0.778271 0.170248
985 Kosovo Central and Eastern Europe Europe Southern Europe 2019 6.425144 9.338535 0.842511 NaN 0.841190 0.246990 0.920297 0.748522 0.140792
986 Kosovo Central and Eastern Europe Europe Southern Europe 2020 6.294414 9.140673 0.792374 NaN 0.879838 0.139896 0.909894 0.726240 0.201458
1381 North Cyprus Western Europe Asia Western Asia 2012 5.463305 NaN 0.871150 NaN 0.692568 NaN 0.854730 0.709236 0.405435
1382 North Cyprus Western Europe Asia Western Asia 2013 5.566803 NaN 0.869274 NaN 0.775383 NaN 0.715356 0.621554 0.442972
1383 North Cyprus Western Europe Asia Western Asia 2014 5.785979 NaN 0.801802 NaN 0.829677 NaN 0.692221 0.723842 0.311336
1384 North Cyprus Western Europe Asia Western Asia 2015 5.842550 NaN 0.791383 NaN 0.785353 NaN 0.659180 0.701609 0.318930
1385 North Cyprus Western Europe Asia Western Asia 2016 5.827128 NaN 0.807690 NaN 0.796234 NaN 0.670191 0.643664 0.346465
1386 North Cyprus Western Europe Asia Western Asia 2018 5.608056 NaN 0.837392 NaN 0.797066 NaN 0.613837 0.480453 0.261868
1387 North Cyprus Western Europe Asia Western Asia 2019 5.466615 NaN 0.803295 NaN 0.792735 NaN 0.640059 0.493693 0.296411
1684 Somaliland region None Africa Eastern Africa 2009 4.991400 NaN 0.879567 NaN 0.746304 NaN 0.513372 0.818879 0.112012
1685 Somaliland region None Africa Eastern Africa 2010 4.657363 NaN 0.829005 NaN 0.820182 NaN 0.471094 0.769375 0.083426
1686 Somaliland region None Africa Eastern Africa 2011 4.930572 NaN 0.787962 NaN 0.858104 NaN 0.357341 0.748686 0.122244
1687 Somaliland region None Africa Eastern Africa 2012 5.057314 NaN 0.786291 NaN 0.758219 NaN 0.333832 0.735189 0.152428
In [71]:
df[df['Generosity'].isna()]
Out[71]:
Country name Regional indicator Regional_indicator_consultado_Major Regional_indicator_consultado year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
454 Cuba None Latin America and the Caribbean Caribbean 2006 5.417869 NaN 0.969595 68.440002 0.281458 NaN NaN 0.646712 0.276602
1381 North Cyprus Western Europe Asia Western Asia 2012 5.463305 NaN 0.871150 NaN 0.692568 NaN 0.854730 0.709236 0.405435
1382 North Cyprus Western Europe Asia Western Asia 2013 5.566803 NaN 0.869274 NaN 0.775383 NaN 0.715356 0.621554 0.442972
1383 North Cyprus Western Europe Asia Western Asia 2014 5.785979 NaN 0.801802 NaN 0.829677 NaN 0.692221 0.723842 0.311336
1384 North Cyprus Western Europe Asia Western Asia 2015 5.842550 NaN 0.791383 NaN 0.785353 NaN 0.659180 0.701609 0.318930
1385 North Cyprus Western Europe Asia Western Asia 2016 5.827128 NaN 0.807690 NaN 0.796234 NaN 0.670191 0.643664 0.346465
1386 North Cyprus Western Europe Asia Western Asia 2018 5.608056 NaN 0.837392 NaN 0.797066 NaN 0.613837 0.480453 0.261868
1387 North Cyprus Western Europe Asia Western Asia 2019 5.466615 NaN 0.803295 NaN 0.792735 NaN 0.640059 0.493693 0.296411
1681 Somalia None Africa Eastern Africa 2014 5.528273 NaN 0.610836 49.599998 0.873879 NaN 0.456470 0.834454 0.207215
1682 Somalia None Africa Eastern Africa 2015 5.353645 NaN 0.599281 50.099998 0.967869 NaN 0.410236 0.900668 0.186736
1683 Somalia None Africa Eastern Africa 2016 4.667941 NaN 0.594417 50.000000 0.917323 NaN 0.440802 0.891423 0.193282
1684 Somaliland region None Africa Eastern Africa 2009 4.991400 NaN 0.879567 NaN 0.746304 NaN 0.513372 0.818879 0.112012
1685 Somaliland region None Africa Eastern Africa 2010 4.657363 NaN 0.829005 NaN 0.820182 NaN 0.471094 0.769375 0.083426
1686 Somaliland region None Africa Eastern Africa 2011 4.930572 NaN 0.787962 NaN 0.858104 NaN 0.357341 0.748686 0.122244
1687 Somaliland region None Africa Eastern Africa 2012 5.057314 NaN 0.786291 NaN 0.758219 NaN 0.333832 0.735189 0.152428
1720 South Sudan None Africa Northern Africa 2014 3.831992 NaN 0.545118 49.840000 0.567259 NaN 0.741541 0.614024 0.428320
1721 South Sudan None Africa Northern Africa 2015 4.070771 NaN 0.584781 50.200001 0.511631 NaN 0.709606 0.586278 0.449795
1722 South Sudan None Africa Northern Africa 2016 2.888112 NaN 0.532152 50.599998 0.439919 NaN 0.785318 0.614771 0.549257
1723 South Sudan None Africa Northern Africa 2017 2.816622 NaN 0.556823 51.000000 0.456011 NaN 0.761270 0.585602 0.517364
In [72]:
df[df['Perceptions of corruption'].isna()]
Out[72]:
Country name Regional indicator Regional_indicator_consultado_Major Regional_indicator_consultado year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
367 China East Asia Asia Eastern Asia 2006 4.560495 8.696120 0.747011 66.879997 0.851083 -0.169039 NaN 0.809295 0.169580
368 China East Asia Asia Eastern Asia 2007 4.862862 8.823954 0.810852 67.059998 0.851083 -0.176243 NaN 0.817485 0.158614
369 China East Asia Asia Eastern Asia 2008 4.846295 8.910992 0.748287 67.239998 0.853072 -0.092472 NaN 0.817443 0.146963
370 China East Asia Asia Eastern Asia 2009 4.454361 8.995857 0.798034 67.419998 0.771143 -0.160481 NaN 0.785806 0.161650
371 China East Asia Asia Eastern Asia 2010 4.652737 9.092104 0.767753 67.599998 0.804794 -0.133318 NaN 0.765265 0.158100
372 China East Asia Asia Eastern Asia 2011 5.037208 9.178532 0.787171 67.760002 0.824162 -0.186383 NaN 0.820074 0.133503
373 China East Asia Asia Eastern Asia 2012 5.094917 9.249320 0.787818 67.919998 0.808255 -0.184676 NaN 0.820785 0.158703
374 China East Asia Asia Eastern Asia 2013 5.241090 9.319200 0.777896 68.080002 0.804724 -0.157777 NaN 0.836431 0.142211
375 China East Asia Asia Eastern Asia 2014 5.195619 9.385755 0.820366 68.239998 0.851083 -0.216772 NaN 0.853975 0.111518
376 China East Asia Asia Eastern Asia 2015 5.303878 9.448723 0.793734 68.400002 0.851083 -0.244435 NaN 0.808911 0.171315
377 China East Asia Asia Eastern Asia 2016 5.324956 9.509552 0.741703 68.699997 0.851083 -0.227522 NaN 0.826144 0.145625
378 China East Asia Asia Eastern Asia 2017 5.099061 9.571116 0.772033 69.000000 0.877618 -0.174832 NaN 0.821097 0.214005
379 China East Asia Asia Eastern Asia 2018 5.131434 9.631892 0.787605 69.300003 0.895378 -0.158510 NaN 0.855784 0.189640
380 China East Asia Asia Eastern Asia 2019 5.144120 9.687612 0.821936 69.599998 0.927356 -0.173036 NaN 0.890780 0.146512
381 China East Asia Asia Eastern Asia 2020 5.771065 9.701755 0.808334 69.900002 0.891123 -0.103214 NaN 0.789345 0.244918
454 Cuba None Latin America and the Caribbean Caribbean 2006 5.417869 NaN 0.969595 68.440002 0.281458 NaN NaN 0.646712 0.276602
1144 Maldives South Asia Asia Southern Asia 2018 5.197575 9.825986 0.913315 70.599998 0.854759 0.023998 NaN NaN NaN
1414 Oman None Asia Western Asia 2011 6.852982 10.382462 NaN 65.500000 0.916293 0.024908 NaN NaN 0.295164
1535 Qatar None Asia Western Asia 2010 6.849653 11.519814 0.863325 66.699997 0.898004 0.103687 NaN 0.734913 0.302685
1536 Qatar None Asia Western Asia 2011 6.591604 11.553021 0.857351 67.019997 0.904687 0.011700 NaN 0.760927 0.327790
1537 Qatar None Asia Western Asia 2012 6.611299 11.523082 0.838132 67.339996 0.924334 0.161530 NaN 0.765899 0.322181
1538 Qatar None Asia Western Asia 2015 6.374529 11.485615 0.863325 68.300003 0.898004 0.127954 NaN 0.734913 0.302685
1904 Turkmenistan Commonwealth of Independent States Asia Central Asia 2009 6.567713 8.989171 0.923846 59.439999 0.787891 -0.101684 NaN 0.780770 0.151584
1905 Turkmenistan Commonwealth of Independent States Asia Central Asia 2011 5.791755 9.181697 0.964419 60.040001 0.787891 0.018397 NaN 0.639033 0.122068
1906 Turkmenistan Commonwealth of Independent States Asia Central Asia 2012 5.463827 9.268988 0.945841 60.279999 0.785563 -0.122812 NaN 0.584448 0.116881
1907 Turkmenistan Commonwealth of Independent States Asia Central Asia 2013 5.391763 9.347593 0.845733 60.520000 0.704529 -0.071448 NaN 0.598716 0.159606
1908 Turkmenistan Commonwealth of Independent States Asia Central Asia 2014 5.787379 9.427173 0.908927 60.759998 0.804678 0.031971 NaN 0.695216 0.153950
1909 Turkmenistan Commonwealth of Independent States Asia Central Asia 2015 5.791460 9.472206 0.960158 61.000000 0.701358 0.092775 NaN 0.705348 0.301039
1910 Turkmenistan Commonwealth of Independent States Asia Central Asia 2016 5.887052 9.515066 0.929032 61.400002 0.748504 0.004624 NaN 0.636389 0.255499
1911 Turkmenistan Commonwealth of Independent States Asia Central Asia 2017 5.229149 9.561351 0.908455 61.799999 0.720399 0.066041 NaN 0.520885 0.349628
1912 Turkmenistan Commonwealth of Independent States Asia Central Asia 2018 4.620602 9.605440 0.984489 62.200001 0.857774 0.259659 NaN 0.612210 0.189025
1913 Turkmenistan Commonwealth of Independent States Asia Central Asia 2019 5.474300 9.651184 0.981502 62.599998 0.891527 0.284881 NaN 0.509915 0.183343
In [73]:
df[df['Positive affect'].isna()]
Out[73]:
Country name Regional indicator Regional_indicator_consultado_Major Regional_indicator_consultado year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
1144 Maldives South Asia Asia Southern Asia 2018 5.197575 9.825986 0.913315 70.599998 0.854759 0.023998 NaN NaN NaN
1145 Maldives South Asia Asia Southern Asia 2021 5.197600 9.825986 0.913161 70.599998 0.853963 0.023998 0.82465 NaN NaN
1414 Oman None Asia Western Asia 2011 6.852982 10.382462 NaN 65.500000 0.916293 0.024908 NaN NaN 0.295164
In [74]:
df[df['Negative affect'].isna()]
Out[74]:
Country name Regional indicator Regional_indicator_consultado_Major Regional_indicator_consultado year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
1144 Maldives South Asia Asia Southern Asia 2018 5.197575 9.825986 0.913315 70.599998 0.854759 0.023998 NaN NaN NaN
1145 Maldives South Asia Asia Southern Asia 2021 5.197600 9.825986 0.913161 70.599998 0.853963 0.023998 0.82465 NaN NaN



Remoção de casos singulares¶

  • Para fins de objetividade na resposta aos questionamentos levantados, por enquanto, os casos singulares serão descartados
In [75]:
df_corte = df[ (~df['Country name'].isin(casos_singulares)) & (~df['Country name'].isin(uma_amostra)) ]
In [76]:
df_corte.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2014 entries, 0 to 2097
Data columns (total 14 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Country name                         2014 non-null   object 
 1   Regional indicator                   1971 non-null   object 
 2   Regional_indicator_consultado_Major  2014 non-null   object 
 3   Regional_indicator_consultado        2014 non-null   object 
 4   year                                 2014 non-null   int64  
 5   Ladder score                         2014 non-null   float64
 6   Logged GDP per capita                2014 non-null   float64
 7   Social support                       2014 non-null   float64
 8   Healthy life expectancy              2014 non-null   float64
 9   Freedom to make life choices         2014 non-null   float64
 10  Generosity                           2014 non-null   float64
 11  Perceptions of corruption            2014 non-null   float64
 12  Positive affect                      2014 non-null   float64
 13  Negative affect                      2014 non-null   float64
dtypes: float64(9), int64(1), object(4)
memory usage: 236.0+ KB
In [77]:
df_corte.to_csv('df_preenchido_descarte.csv')
In [78]:
ao_ano_corte = df_corte.groupby(by='year',as_index=False).agg('mean')
corte_scaler = MinMaxScaler()
sc_ano_corte = pd.DataFrame(corte_scaler.fit_transform(ao_ano_corte.drop(columns='year')),columns=ao_ano_corte.drop(columns='year').columns)
sc_ano_corte['year'] = ao_ano_corte['year']
sc_ano_corte = sc_ano_corte[[*ao_ano_corte.columns]]
In [79]:
print(f"Sobre o scaler dos dados com na:\nFeatures: {corte_scaler.feature_names_in_}\nValores Máximos: {corte_scaler.data_max_}\nValores Mínimos: {corte_scaler.data_min_}\nFaixa de valores: {corte_scaler.feature_range}\nParâmetros gerais: {corte_scaler.get_params()}")
Sobre o scaler dos dados com na:
Features: ['Ladder score' 'Logged GDP per capita' 'Social support'
 'Healthy life expectancy' 'Freedom to make life choices' 'Generosity'
 'Perceptions of corruption' 'Positive affect' 'Negative affect']
Valores Máximos: [6.44616427e+00 1.01186379e+01 8.97366864e-01 6.70925527e+01
 8.22788169e-01 1.99202307e-02 7.88631714e-01 7.43805360e-01
 2.95408072e-01]
Valores Mínimos: [ 5.19811267e+00  9.02855413e+00  7.83287293e-01  5.99727907e+01
  6.84637176e-01 -2.76569087e-02  7.03033391e-01  7.00210670e-01
  2.43803148e-01]
Faixa de valores: (0, 1)
Parâmetros gerais: {'clip': False, 'copy': True, 'feature_range': (0, 1)}

Q3 - Influência da Pandemia, com dados descartados¶

In [80]:
fig = go.Figure()
cols = [*sc_ano_corte.columns]
cols.remove('year')
for column in cols:
    fig.add_trace(go.Scatter( x = sc_ano_corte.year, y = sc_ano_corte[column], name = column, mode = 'lines') )
fig.update_layout(title = "Indicadores Escalonados Globais por Ano [com dados descartados]", xaxis_title = 'Ano')
fig.show()
fig.write_html('indicadores_global_norm_descarte.html')

Todos foram impactados pela pandemia, embora Perceptions of corruption, Generosity tenham comportamento invertido

In [81]:
br_ano_corte = df_corte[df_corte['Country name']=='Brazil'].groupby(by='year',as_index=False).agg('mean')
cols = [*br_ano_corte.columns]
cols.remove('year')
fig = go.Figure()
for column in cols:
    fig.add_trace(go.Scatter( x = br_ano_corte.year, y = br_ano_corte[column], name = column, mode = 'lines') )
fig.update_layout(title = "Indicadores por Ano [Brasil, não-normalizados, pós-descarte]", xaxis_title = 'Ano')
fig.show()
fig.write_html('indicadores_br_raw_descarte.html')
  • Não há mundaças no comportamento geral após as manipulações



Q2 - Estimativa de Region com base nos atributos¶

  • Regional_indicator_consultado_Major e Regional_indicator_consultado serão removidos dessa análise
  • Prever a região (Regional_indicator) é uma tarefa de classificação
In [82]:
df1 = pd.read_csv('df_preenchido_descarte.csv', index_col=0) # dataset com dados excluídos
In [83]:
df1.drop_duplicates(subset=None, keep='first', inplace=True, ignore_index=True)
aux_columns = df1[['Regional_indicator_consultado_Major','Regional_indicator_consultado']]
df1.drop(columns=['Regional_indicator_consultado_Major','Regional_indicator_consultado'], inplace=True) # removendo colunas auxiliares
df1
Out[83]:
Country name Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 Afghanistan South Asia 2008 3.723590 7.370100 0.450662 50.799999 0.718114 0.167640 0.881686 0.517637 0.258195
1 Afghanistan South Asia 2009 4.401778 7.539972 0.552308 51.200001 0.678896 0.190099 0.850035 0.583926 0.237092
2 Afghanistan South Asia 2010 4.758381 7.646709 0.539075 51.599998 0.600127 0.120590 0.706766 0.618265 0.275324
3 Afghanistan South Asia 2011 3.831719 7.619532 0.521104 51.919998 0.495901 0.162427 0.731109 0.611387 0.267175
4 Afghanistan South Asia 2012 3.782938 7.705479 0.520637 52.240002 0.530935 0.236032 0.775620 0.710385 0.267919
... ... ... ... ... ... ... ... ... ... ... ... ...
2009 Zimbabwe Sub-Saharan Africa 2017 3.638300 8.015738 0.754147 55.000000 0.752826 -0.097645 0.751208 0.806428 0.224051
2010 Zimbabwe Sub-Saharan Africa 2018 3.616480 8.048798 0.775388 55.599998 0.762675 -0.068427 0.844209 0.710119 0.211726
2011 Zimbabwe Sub-Saharan Africa 2019 2.693523 7.950132 0.759162 56.200001 0.631908 -0.063791 0.830652 0.716004 0.235354
2012 Zimbabwe Sub-Saharan Africa 2020 3.159802 7.828757 0.717243 56.799999 0.643303 -0.008696 0.788523 0.702573 0.345736
2013 Zimbabwe Sub-Saharan Africa 2021 3.144800 7.942595 0.750470 56.200840 0.676700 -0.047346 0.820999 0.717712 0.224420

2014 rows × 12 columns

In [84]:
df1['Regional indicator'].value_counts()
Out[84]:
Sub-Saharan Africa                    426
Latin America and Caribbean           299
Western Europe                        284
Middle East and North Africa          228
Central and Eastern Europe            227
Commonwealth of Independent States    171
Southeast Asia                        125
South Asia                             89
North America and ANZ                  62
East Asia                              60
Name: Regional indicator, dtype: int64
  • Regional indicator é uma classe desbalanceada



Tratamento pré-modelo¶

  • Serão aplicados o Random Forests e o CatBoost
  • Para o random forests não podemos usar atributos categóricos, então é necessário realizar o encoding destes atributos
  • O CatBoost consegue processar dados numéricos e categóricos, não se fazendo necessário codificar estes

Criação dos datasets de treino, teste e previsão¶

In [85]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2014 entries, 0 to 2013
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country name                  2014 non-null   object 
 1   Regional indicator            1971 non-null   object 
 2   year                          2014 non-null   int64  
 3   Ladder score                  2014 non-null   float64
 4   Logged GDP per capita         2014 non-null   float64
 5   Social support                2014 non-null   float64
 6   Healthy life expectancy       2014 non-null   float64
 7   Freedom to make life choices  2014 non-null   float64
 8   Generosity                    2014 non-null   float64
 9   Perceptions of corruption     2014 non-null   float64
 10  Positive affect               2014 non-null   float64
 11  Negative affect               2014 non-null   float64
dtypes: float64(9), int64(1), object(2)
memory usage: 188.9+ KB
In [86]:
colunas_originais = [*df1.columns]
sem_tgt_paises = [*df1.columns] 
sem_tgt_paises.remove('Country name')
sem_tgt_paises.remove('Regional indicator') # colunas do dataset original sem target e sem paises
  • Para transformar uma tributo categórico em numérico sem que se passe a falsa ideia de que há uma relação cardinal entre os labels do atributo, é necessário que se faça uma "dummyficação"
  • Como para o Catboost isso não se faz necessário, será criado um dataframe separado para a aplicação de cada modelo

Dummy encoding¶

  • Dummy encoding para o Random Forest e criação do segundo dataframe para aplicar o Catboost
In [87]:
countries = df1['Country name']
Descartados_profile = pr(df1, title="Profile Report com dados descartados", explorative=True, progress_bar=False)
Descartados_profile.to_file(f"profile_com_descartados.html")
df2 = df1.copy()
df1 = pd.get_dummies(df1, columns=['Country name'], prefix='', prefix_sep='', sparse=False, dtype=bool)
In [88]:
df_target = df1[df1['Regional indicator'].isna()] # para o random forests
df1 = df1[~df1['Regional indicator'].isna()] # para o random forests

df_target2 = df2[df2['Regional indicator'].isna()] # para o catboost
df2 = df2[~df2['Regional indicator'].isna()] # para o catboost
target_countries = countries[countries.index.isin(df_target.index)]
In [89]:
df2.head(3)
Out[89]:
Country name Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 Afghanistan South Asia 2008 3.723590 7.370100 0.450662 50.799999 0.718114 0.167640 0.881686 0.517637 0.258195
1 Afghanistan South Asia 2009 4.401778 7.539972 0.552308 51.200001 0.678896 0.190099 0.850035 0.583926 0.237092
2 Afghanistan South Asia 2010 4.758381 7.646709 0.539075 51.599998 0.600127 0.120590 0.706766 0.618265 0.275324
In [90]:
df1.head(3)
Out[90]:
Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect ... United Arab Emirates United Kingdom United States Uruguay Uzbekistan Venezuela Vietnam Yemen Zambia Zimbabwe
0 South Asia 2008 3.723590 7.370100 0.450662 50.799999 0.718114 0.167640 0.881686 0.517637 ... False False False False False False False False False False
1 South Asia 2009 4.401778 7.539972 0.552308 51.200001 0.678896 0.190099 0.850035 0.583926 ... False False False False False False False False False False
2 South Asia 2010 4.758381 7.646709 0.539075 51.599998 0.600127 0.120590 0.706766 0.618265 ... False False False False False False False False False False

3 rows × 163 columns

In [91]:
df_target.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 43 entries, 36 to 1801
Columns: 163 entries, Regional indicator to Zimbabwe
dtypes: bool(152), float64(9), int64(1), object(1)
memory usage: 10.4+ KB
In [92]:
df_target2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 43 entries, 36 to 1801
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country name                  43 non-null     object 
 1   Regional indicator            0 non-null      object 
 2   year                          43 non-null     int64  
 3   Ladder score                  43 non-null     float64
 4   Logged GDP per capita         43 non-null     float64
 5   Social support                43 non-null     float64
 6   Healthy life expectancy       43 non-null     float64
 7   Freedom to make life choices  43 non-null     float64
 8   Generosity                    43 non-null     float64
 9   Perceptions of corruption     43 non-null     float64
 10  Positive affect               43 non-null     float64
 11  Negative affect               43 non-null     float64
dtypes: float64(9), int64(1), object(2)
memory usage: 4.4+ KB



Oversampling com SMOTE¶

In [93]:
y = df1['Regional indicator']
y2 = df2['Regional indicator']
SMOTE RF¶
In [94]:
sm = SMOTE(random_state=42, n_jobs=-1)
X_res, y_res = sm.fit_resample(df1.drop(columns=['Regional indicator']), df1['Regional indicator'])
In [95]:
print(f"antes: {df1.shape[0]}, depois: {X_res.shape[0]}")
antes: 1971, depois: 4260
In [96]:
y_res.value_counts()
Out[96]:
South Asia                            426
Central and Eastern Europe            426
Middle East and North Africa          426
Latin America and Caribbean           426
Commonwealth of Independent States    426
North America and ANZ                 426
Western Europe                        426
Sub-Saharan Africa                    426
Southeast Asia                        426
East Asia                             426
Name: Regional indicator, dtype: int64
SMOTE CatBoost¶
In [97]:
smnc = SMOTENC(random_state=42, n_jobs=-1, categorical_features=[0])
X_res2, y_res2 = smnc.fit_resample(df2.drop(columns=['Regional indicator']), df2['Regional indicator'])
In [98]:
print(f"antes: {df2.shape[0]}, depois: {X_res2.shape[0]}")
antes: 1971, depois: 4260
In [99]:
y_res2.value_counts()
Out[99]:
South Asia                            426
Central and Eastern Europe            426
Middle East and North Africa          426
Latin America and Caribbean           426
Commonwealth of Independent States    426
North America and ANZ                 426
Western Europe                        426
Sub-Saharan Africa                    426
Southeast Asia                        426
East Asia                             426
Name: Regional indicator, dtype: int64
  • Balanceado

RF¶

Treino¶

In [100]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.15, random_state=42)
In [101]:
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)
Out[101]:
RandomForestClassifier(max_depth=2, random_state=0)
In [102]:
clf.score(X_test,y_test)
Out[102]:
0.7449139280125195
In [103]:
y_tgt = clf.predict( df_target.drop(columns=['Regional indicator']) )
In [104]:
df_target['Country name'] = target_countries
df_target['região estimada'] = y_tgt
df_target = df_target[['Country name','região estimada','Regional indicator', 'year', 'Ladder score',
                       'Logged GDP per capita', 'Social support', 'Healthy life expectancy',
                       'Freedom to make life choices', 'Generosity', 'Perceptions of corruption',
                       'Positive affect', 'Negative affect']]
In [105]:
df_target.head(5)
Out[105]:
Country name região estimada Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
36 Angola Sub-Saharan Africa NaN 2011 5.589001 8.945782 0.723094 52.500000 0.583702 0.055257 0.911320 0.658647 0.361063
37 Angola Sub-Saharan Africa NaN 2012 4.360250 8.991773 0.752593 53.200001 0.456029 -0.136070 0.906300 0.557908 0.304890
38 Angola Sub-Saharan Africa NaN 2013 3.937107 9.004611 0.721591 53.900002 0.409555 -0.103557 0.816375 0.658284 0.370875
39 Angola Sub-Saharan Africa NaN 2014 3.794838 9.016735 0.754615 54.599998 0.374542 -0.167723 0.834076 0.578517 0.367864
173 Belize Latin America and Caribbean NaN 2007 6.450644 8.892479 0.872267 61.599998 0.705306 0.032754 0.768984 0.758783 0.250596

Resultados¶

In [106]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
print(f"ROC AUC: {round(roc_auc_score(y_test, clf.predict_proba(X_test), multi_class='ovr'),4)}\nRecall: {round(recall_score(y_test, clf.predict(X_test), average='weighted'),4)}")
print(classification_report(y_test, clf.predict(X_test)))#, target_names=labels))
ROC AUC: 0.9496
Recall: 0.7449
                                    precision    recall  f1-score   support

        Central and Eastern Europe       0.91      0.68      0.78        63
Commonwealth of Independent States       0.95      0.78      0.86        78
                         East Asia       0.86      1.00      0.93        64
       Latin America and Caribbean       0.82      0.78      0.80        60
      Middle East and North Africa       0.72      0.58      0.64        59
             North America and ANZ       0.57      1.00      0.73        66
                        South Asia       0.68      0.74      0.71        61
                    Southeast Asia       0.81      0.84      0.83        57
                Sub-Saharan Africa       0.72      0.81      0.76        67
                    Western Europe       0.40      0.22      0.28        64

                          accuracy                           0.74       639
                         macro avg       0.75      0.74      0.73       639
                      weighted avg       0.75      0.74      0.73       639

  • Apesar da aplicação do oversampling, o modelo ainda tem um desempenho muito ruim quanto a algumas classes
In [107]:
df1['Regional indicator'] = y_res
In [108]:
df1.to_csv('descarte_balanceado_rf.csv')
In [109]:
res1 = df_target[['Country name','região estimada']]
res1 = pd.concat([res1,aux_columns[aux_columns.index.isin(df_target.index)]], axis=1)
res1
Out[109]:
Country name região estimada Regional_indicator_consultado_Major Regional_indicator_consultado
36 Angola Sub-Saharan Africa Africa Middle Africa
37 Angola Sub-Saharan Africa Africa Middle Africa
38 Angola Sub-Saharan Africa Africa Middle Africa
39 Angola Sub-Saharan Africa Africa Middle Africa
173 Belize Latin America and Caribbean Latin America and the Caribbean Central America
174 Belize Latin America and Caribbean Latin America and the Caribbean Central America
188 Bhutan Southeast Asia Asia Southern Asia
189 Bhutan Southeast Asia Asia Southern Asia
190 Bhutan Southeast Asia Asia Southern Asia
331 Central African Republic Sub-Saharan Africa Africa Middle Africa
332 Central African Republic Sub-Saharan Africa Africa Middle Africa
333 Central African Republic Sub-Saharan Africa Africa Middle Africa
334 Central African Republic Sub-Saharan Africa Africa Middle Africa
335 Central African Republic Sub-Saharan Africa Africa Middle Africa
401 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
402 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
403 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
404 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
405 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
406 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
407 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
408 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
481 Djibouti Sub-Saharan Africa Africa Eastern Africa
482 Djibouti Sub-Saharan Africa Africa Eastern Africa
483 Djibouti Sub-Saharan Africa Africa Eastern Africa
484 Djibouti Sub-Saharan Africa Africa Eastern Africa
1682 Sudan Sub-Saharan Africa Africa Northern Africa
1683 Sudan Sub-Saharan Africa Africa Northern Africa
1684 Sudan Sub-Saharan Africa Africa Northern Africa
1685 Sudan Sub-Saharan Africa Africa Northern Africa
1686 Sudan Sub-Saharan Africa Africa Northern Africa
1718 Syria South Asia Asia Western Asia
1719 Syria Sub-Saharan Africa Asia Western Asia
1720 Syria Central and Eastern Europe Asia Western Asia
1721 Syria South Asia Asia Western Asia
1722 Syria South Asia Asia Western Asia
1723 Syria South Asia Asia Western Asia
1724 Syria Sub-Saharan Africa Asia Western Asia
1797 Trinidad and Tobago Latin America and Caribbean Latin America and the Caribbean Caribbean
1798 Trinidad and Tobago Latin America and Caribbean Latin America and the Caribbean Caribbean
1799 Trinidad and Tobago Latin America and Caribbean Latin America and the Caribbean Caribbean
1800 Trinidad and Tobago Southeast Asia Latin America and the Caribbean Caribbean
1801 Trinidad and Tobago Latin America and Caribbean Latin America and the Caribbean Caribbean
  • Não há um gabarito pois as regiões para estas amostras não haviam sido fornecidas anteriormente, mas se compararmos com os dados da ONU vemos que o modelo tem um bom desempenho exceto quando os países se encontram entre o Oriente Médio e o Oeste da Ásia
  • Vale lembrar que o dataset usa uma definição de regiões diferente da ONU, e muitas das regiões definidas pela ONU não estão presentes no dataset original


Exemplo de Caso de confusão¶
  • Se observarmos o caso peculiar onde o modelo classificou a Síria como pertencente à Europa Centro-Oriental, veremos que os indicadores são bem smelhantes
In [110]:
df_target[ (df_target['Country name'] == 'Syria') & (df_target['região estimada'] == 'Central and Eastern Europe') ]
Out[110]:
Country name região estimada Regional indicator year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
1720 Syria Central and Eastern Europe NaN 2010 4.464708 8.729084 0.934232 64.099998 0.647048 0.007883 0.743094 0.557652 0.224644
In [111]:
df1[ (df1['year'] == 2010) & (df1['Regional indicator'] == 'Central and Eastern Europe') ][['Regional indicator', 'year', 'Ladder score','Logged GDP per capita', 'Social support',
                                                                                            'Healthy life expectancy','Freedom to make life choices', 'Generosity', 'Perceptions of corruption',
                                                                                            'Positive affect', 'Negative affect']].groupby(by='Regional indicator').agg('mean')
Out[111]:
year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
Regional indicator
Central and Eastern Europe 2010.0 5.142485 8.820354 0.801815 59.317647 0.688035 -0.033074 0.782977 0.669645 0.246042
In [112]:
joblib.dump(clf, 'rf0.joblib')

df_target.to_csv('rf_resultados_tabela.csv')
res1.to_csv('rf_comp_direta.csv')


Features¶

In [113]:
fi = pd.DataFrame({'features':X_train.columns,'importances':clf.feature_importances_})
fi.sort_values(by='importances', ascending=False, inplace=True)
fi.index = range(fi.shape[0])
sem_imp = fi[fi.importances == 0]
fi = fi[fi.importances > 0]
fi[fi.importances > 0.01]
Out[113]:
features importances
0 Logged GDP per capita 0.098954
1 Ladder score 0.089725
2 Healthy life expectancy 0.086614
3 Positive affect 0.079099
4 Social support 0.076514
5 Perceptions of corruption 0.057689
6 Generosity 0.047708
7 Mongolia 0.038714
8 Australia 0.037656
9 Canada 0.035235
10 Taiwan Province of China 0.033808
11 Freedom to make life choices 0.031993
12 Japan 0.021933
13 Indonesia 0.021338
14 Thailand 0.020518
15 Negative affect 0.018886
16 United States 0.018211
17 Pakistan 0.016323
18 New Zealand 0.016167
19 South Korea 0.013238
20 Moldova 0.010937
21 Nepal 0.010531
Features sem importância¶
In [114]:
print(f"Atributos sem importância: {len(sem_imp.features.values)}\n\n{sem_imp.features.values}")
Atributos sem importância: 105

['Zambia' 'Morocco' 'Malta' 'Nicaragua' 'Uruguay' 'Mauritania'
 'Uzbekistan' 'Venezuela' 'Montenegro' 'Yemen' 'Mauritius' 'Mozambique'
 'Niger' 'Mexico' 'Namibia' 'Sri Lanka' 'Nigeria' 'Uganda' 'Sudan'
 'South Africa' 'Mali' 'Swaziland' 'Sierra Leone' 'Serbia' 'Senegal'
 'Russia' 'Poland' 'Switzerland' 'Syria' 'Tanzania' 'Togo' 'Paraguay'
 'Panama' 'Trinidad and Tobago' 'Norway' 'Spain' 'Tunisia'
 'North Macedonia' 'Kenya' 'Malawi' 'Burkina Faso' 'Denmark'
 'Czech Republic' 'Cyprus' 'Croatia' 'Costa Rica' 'Congo (Kinshasa)'
 'Congo (Brazzaville)' 'Comoros' 'Colombia' 'Chile' 'Chad'
 'Central African Republic' 'Cameroon' 'Burundi' 'Bulgaria' 'Madagascar'
 'Brazil' 'Botswana' 'Bosnia and Herzegovina' 'Bolivia' 'Bhutan' 'Benin'
 'Belize' 'Belgium' 'Bahrain' 'Armenia' 'Argentina' 'Angola' 'Algeria'
 'Afghanistan' 'Djibouti' 'Dominican Republic' 'Ecuador' 'Egypt'
 'Luxembourg' 'Lithuania' 'Libya' 'Liberia' 'Lesotho' 'Lebanon' 'Latvia'
 'Laos' 'Kuwait' 'Jordan' 'Jamaica' 'Ivory Coast' 'Italy' 'Ireland' 'Iraq'
 'Iceland' 'Hungary' 'Honduras' 'Haiti' 'Guinea' 'Guatemala' 'Greece'
 'Ghana' 'Gambia' 'Gabon' 'France' 'Finland' 'Ethiopia' 'Estonia'
 'Zimbabwe']
In [115]:
fi[(fi.features.isin(sem_tgt_paises))]
Out[115]:
features importances
0 Logged GDP per capita 0.098954
1 Ladder score 0.089725
2 Healthy life expectancy 0.086614
3 Positive affect 0.079099
4 Social support 0.076514
5 Perceptions of corruption 0.057689
6 Generosity 0.047708
11 Freedom to make life choices 0.031993
15 Negative affect 0.018886
44 year 0.000922
Features de nomes de países¶
In [116]:
features_paises = fi[~fi.features.isin(sem_tgt_paises)]
print(f"Total de \"features de países\": {features_paises.shape[0]}\nTotal de importância: {features_paises.importances.sum()}")
Total de "features de países": 47
Total de importância: 0.41189564926463523
  • A importância somada dos países é muito alta se comparada com os demais atributos
In [117]:
mf_sem_paises_rf = [*fi[(fi.features.isin(sem_tgt_paises))].features] # melhores features desconsiderando os países
com_paises_rf = [*fi[fi.importances >= 0.0005].features] # desconsiderando os países

Re-treino (comportamento sem os nomes dos países)¶

In [118]:
X_train4, X_test4, y_train4, y_test4 = train_test_split(X_res[mf_sem_paises_rf], y_res, test_size=0.15, random_state=42)
In [119]:
clf2 = RandomForestClassifier(max_depth=2, random_state=0)
clf2.fit(X_train4, y_train4)
Out[119]:
RandomForestClassifier(max_depth=2, random_state=0)
In [120]:
clf2.score(X_test4,y_test4)
Out[120]:
0.5774647887323944
  • Retirar os nomes dos países do modelo faz a acurácia cair muito, conforme esperado

Resultados re-treino¶

In [121]:
print(f"ROC AUC: {round(roc_auc_score(y_test4, clf2.predict_proba(X_test4), multi_class='ovr'),4)}\nRecall: {round(recall_score(y_test4, clf2.predict(X_test4), average='weighted'),4)}")
print(classification_report(y_test4, clf2.predict(X_test4)))#, target_names=labels))
ROC AUC: 0.9142
Recall: 0.5775
                                    precision    recall  f1-score   support

        Central and Eastern Europe       0.48      0.54      0.51        63
Commonwealth of Independent States       0.56      0.45      0.50        78
                         East Asia       0.50      0.77      0.60        64
       Latin America and Caribbean       0.88      0.70      0.78        60
      Middle East and North Africa       0.53      0.29      0.37        59
             North America and ANZ       0.60      1.00      0.75        66
                        South Asia       0.69      0.36      0.47        61
                    Southeast Asia       0.67      0.56      0.61        57
                Sub-Saharan Africa       0.54      0.94      0.69        67
                    Western Europe       0.41      0.14      0.21        64

                          accuracy                           0.58       639
                         macro avg       0.59      0.57      0.55       639
                      weighted avg       0.58      0.58      0.55       639

  • Apesar da queda geral no desempenho do modelo, ainda há regiões onde ele se sai bem
In [ ]:
 

CatBoost¶

Treino¶

In [122]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_res2, y_res2, test_size=0.35, random_state=42)
In [123]:
cb = CatBoostClassifier(
    custom_loss=[metrics.AUCMulticlass()],
    random_seed=42,
    logging_level='Silent',
    iterations=150
)
In [124]:
cb.fit(
    X_train2, y_train2,
    cat_features=[0],
    eval_set=(X_test2, y_test2),
    # logging_level='Verbose',
    plot=True
);
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
In [125]:
cb.score(X_test2,y_test2)
Out[125]:
1.0
In [126]:
cb.best_score_
Out[126]:
{'learn': {'MultiClass': 0.010164689602621865},
 'validation': {'AUC:type=Mu': 1.0, 'MultiClass': 0.0038862307858915017}}

Resultados¶

In [127]:
print(f"ROC AUC: {round(roc_auc_score(y_test2, cb.predict_proba(X_test2), multi_class='ovr'),4)}\nRecall: {round(recall_score(y_test2, cb.predict(X_test2), average='weighted'),4)}")
print(classification_report(y_test2, cb.predict(X_test2)))#, target_names=labels))
ROC AUC: 1.0
Recall: 1.0
                                    precision    recall  f1-score   support

        Central and Eastern Europe       1.00      1.00      1.00       148
Commonwealth of Independent States       1.00      1.00      1.00       158
                         East Asia       1.00      1.00      1.00       153
       Latin America and Caribbean       1.00      1.00      1.00       125
      Middle East and North Africa       1.00      1.00      1.00       149
             North America and ANZ       1.00      1.00      1.00       150
                        South Asia       1.00      1.00      1.00       142
                    Southeast Asia       1.00      1.00      1.00       156
                Sub-Saharan Africa       1.00      1.00      1.00       162
                    Western Europe       1.00      1.00      1.00       148

                          accuracy                           1.00      1491
                         macro avg       1.00      1.00      1.00      1491
                      weighted avg       1.00      1.00      1.00      1491

  • Claros sinais de overfitting
In [128]:
predictions = cb.predict(df_target2.drop(columns=['Regional indicator']))
predictions_probs = cb.predict_proba(df_target2.drop(columns=['Regional indicator']))
df_target2['região estimada'] = predictions
In [129]:
res2 = df_target2[['Country name','região estimada']]
res2 = pd.concat([res2,aux_columns[aux_columns.index.isin(df_target.index)]], axis=1)
res2
Out[129]:
Country name região estimada Regional_indicator_consultado_Major Regional_indicator_consultado
36 Angola Sub-Saharan Africa Africa Middle Africa
37 Angola Sub-Saharan Africa Africa Middle Africa
38 Angola Sub-Saharan Africa Africa Middle Africa
39 Angola Sub-Saharan Africa Africa Middle Africa
173 Belize Latin America and Caribbean Latin America and the Caribbean Central America
174 Belize Sub-Saharan Africa Latin America and the Caribbean Central America
188 Bhutan Sub-Saharan Africa Asia Southern Asia
189 Bhutan Sub-Saharan Africa Asia Southern Asia
190 Bhutan Sub-Saharan Africa Asia Southern Asia
331 Central African Republic Sub-Saharan Africa Africa Middle Africa
332 Central African Republic Sub-Saharan Africa Africa Middle Africa
333 Central African Republic Sub-Saharan Africa Africa Middle Africa
334 Central African Republic Sub-Saharan Africa Africa Middle Africa
335 Central African Republic Sub-Saharan Africa Africa Middle Africa
401 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
402 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
403 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
404 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
405 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
406 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
407 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
408 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
481 Djibouti Sub-Saharan Africa Africa Eastern Africa
482 Djibouti Sub-Saharan Africa Africa Eastern Africa
483 Djibouti Sub-Saharan Africa Africa Eastern Africa
484 Djibouti Sub-Saharan Africa Africa Eastern Africa
1682 Sudan Sub-Saharan Africa Africa Northern Africa
1683 Sudan Sub-Saharan Africa Africa Northern Africa
1684 Sudan Sub-Saharan Africa Africa Northern Africa
1685 Sudan Sub-Saharan Africa Africa Northern Africa
1686 Sudan Sub-Saharan Africa Africa Northern Africa
1718 Syria Middle East and North Africa Asia Western Asia
1719 Syria Middle East and North Africa Asia Western Asia
1720 Syria Middle East and North Africa Asia Western Asia
1721 Syria Middle East and North Africa Asia Western Asia
1722 Syria Sub-Saharan Africa Asia Western Asia
1723 Syria Sub-Saharan Africa Asia Western Asia
1724 Syria Sub-Saharan Africa Asia Western Asia
1797 Trinidad and Tobago Latin America and Caribbean Latin America and the Caribbean Caribbean
1798 Trinidad and Tobago Sub-Saharan Africa Latin America and the Caribbean Caribbean
1799 Trinidad and Tobago Latin America and Caribbean Latin America and the Caribbean Caribbean
1800 Trinidad and Tobago Latin America and Caribbean Latin America and the Caribbean Caribbean
1801 Trinidad and Tobago Latin America and Caribbean Latin America and the Caribbean Caribbean
  • Resultados questionáveis
In [130]:
df_target2.to_csv('descarte_balanceado_catboost.csv')
In [131]:
import joblib
joblib.dump(cb, 'cb0.joblib')
res2.to_csv('catboost_comp_direta.csv')

Features¶

In [132]:
fi2 = pd.DataFrame({'features':X_train2.columns,'importances':cb.feature_importances_})
fi2.sort_values(by='importances', ascending=False, inplace=True)
fi2.index = range(fi2.shape[0])
sem_imp = fi2[fi2.importances == 0]
fi2 = fi2[fi2.importances > 0]
fi2[fi2.importances > 0.01]
Out[132]:
features importances
0 Country name 58.414788
1 Logged GDP per capita 10.303839
2 Positive affect 9.234681
3 Healthy life expectancy 6.624715
4 Generosity 5.484185
5 Ladder score 4.168928
6 Negative affect 4.115216
7 Social support 0.516048
8 Perceptions of corruption 0.480882
9 Freedom to make life choices 0.464831
10 year 0.191887
In [133]:
print(f"Atributos sem importância: {len(sem_imp.features.values)}\n{sem_imp.features.values}")
Atributos sem importância: 0
[]
  • Diferente do RF, todos os atributos têm importância
  • Mais uma vez, os nomes dos países têm grande influência na decisão
In [134]:
fi2[(fi2.features.isin(sem_tgt_paises))]
Out[134]:
features importances
1 Logged GDP per capita 10.303839
2 Positive affect 9.234681
3 Healthy life expectancy 6.624715
4 Generosity 5.484185
5 Ladder score 4.168928
6 Negative affect 4.115216
7 Social support 0.516048
8 Perceptions of corruption 0.480882
9 Freedom to make life choices 0.464831
10 year 0.191887
In [135]:
mf_sem_paises_rf = [*fi2[(fi2.features.isin(sem_tgt_paises))].features] # melhores features desconsiderando os países
com_paises_rf = [*fi2[fi2.importances >= 0.0005].features] # desconsiderando os países

Re-treino¶

In [136]:
X_train3, X_test3, y_train3, y_test3 = train_test_split(X_res2[mf_sem_paises_rf], y_res2, test_size=0.35, random_state=42)
In [137]:
cb1 = CatBoostClassifier(
    custom_loss=[metrics.AUCMulticlass()],
    random_seed=42,
    logging_level='Silent',
    iterations=10    
)
In [138]:
cb1.fit(
    X_train3, y_train3,
    cat_features=[],
    eval_set=(X_test3, y_test3),
    # logging_level='Verbose',
    plot=True
);
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
In [139]:
cb1.score(X_test3,y_test3)
Out[139]:
0.846411804158283
  • O resultado "piora"
In [140]:
cb1.best_score_
Out[140]:
{'learn': {'MultiClass': 0.4165059699330663},
 'validation': {'AUC:type=Mu': 0.9960896617876441,
  'MultiClass': 0.4972461218352189}}

Resultados do re-treino¶

In [141]:
print(f"ROC AUC: {round(roc_auc_score(y_test3, cb1.predict_proba(X_test3), multi_class='ovr'),4)}\nRecall: {round(recall_score(y_test3, cb1.predict(X_test3), average='weighted'),4)}")
print(classification_report(y_test3, cb1.predict(X_test3)))#, target_names=labels))
ROC AUC: 0.9897
Recall: 0.8464
                                    precision    recall  f1-score   support

        Central and Eastern Europe       0.79      0.78      0.79       148
Commonwealth of Independent States       0.85      0.81      0.83       158
                         East Asia       0.87      0.98      0.92       153
       Latin America and Caribbean       0.84      0.89      0.86       125
      Middle East and North Africa       0.75      0.74      0.75       149
             North America and ANZ       0.91      0.96      0.93       150
                        South Asia       0.82      0.87      0.84       142
                    Southeast Asia       0.91      0.83      0.87       156
                Sub-Saharan Africa       0.89      0.81      0.85       162
                    Western Europe       0.83      0.80      0.81       148

                          accuracy                           0.85      1491
                         macro avg       0.85      0.85      0.85      1491
                      weighted avg       0.85      0.85      0.85      1491

In [142]:
predictions = cb1.predict(df_target2.drop(columns=['Regional indicator','Country name','região estimada']))
predictions_probs = cb1.predict_proba(df_target2.drop(columns=['Regional indicator','Country name','região estimada']))
df_target2['região estimada'] = predictions
In [143]:
res3 = df_target2[['Country name','região estimada']]
res3 = pd.concat([res3,aux_columns[aux_columns.index.isin(df_target.index)]], axis=1)
res3
Out[143]:
Country name região estimada Regional_indicator_consultado_Major Regional_indicator_consultado
36 Angola South Asia Africa Middle Africa
37 Angola Middle East and North Africa Africa Middle Africa
38 Angola Middle East and North Africa Africa Middle Africa
39 Angola Middle East and North Africa Africa Middle Africa
173 Belize Latin America and Caribbean Latin America and the Caribbean Central America
174 Belize Latin America and Caribbean Latin America and the Caribbean Central America
188 Bhutan Southeast Asia Asia Southern Asia
189 Bhutan Southeast Asia Asia Southern Asia
190 Bhutan Southeast Asia Asia Southern Asia
331 Central African Republic Sub-Saharan Africa Africa Middle Africa
332 Central African Republic Sub-Saharan Africa Africa Middle Africa
333 Central African Republic Sub-Saharan Africa Africa Middle Africa
334 Central African Republic Sub-Saharan Africa Africa Middle Africa
335 Central African Republic Sub-Saharan Africa Africa Middle Africa
401 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
402 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
403 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
404 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
405 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
406 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
407 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
408 Congo (Kinshasa) Sub-Saharan Africa Africa Middle Africa
481 Djibouti Sub-Saharan Africa Africa Eastern Africa
482 Djibouti Sub-Saharan Africa Africa Eastern Africa
483 Djibouti Sub-Saharan Africa Africa Eastern Africa
484 Djibouti Sub-Saharan Africa Africa Eastern Africa
1682 Sudan Sub-Saharan Africa Africa Northern Africa
1683 Sudan Sub-Saharan Africa Africa Northern Africa
1684 Sudan Sub-Saharan Africa Africa Northern Africa
1685 Sudan Sub-Saharan Africa Africa Northern Africa
1686 Sudan Sub-Saharan Africa Africa Northern Africa
1718 Syria South Asia Asia Western Asia
1719 Syria South Asia Asia Western Asia
1720 Syria Middle East and North Africa Asia Western Asia
1721 Syria South Asia Asia Western Asia
1722 Syria South Asia Asia Western Asia
1723 Syria South Asia Asia Western Asia
1724 Syria South Asia Asia Western Asia
1797 Trinidad and Tobago Southeast Asia Latin America and the Caribbean Caribbean
1798 Trinidad and Tobago Southeast Asia Latin America and the Caribbean Caribbean
1799 Trinidad and Tobago Southeast Asia Latin America and the Caribbean Caribbean
1800 Trinidad and Tobago Southeast Asia Latin America and the Caribbean Caribbean
1801 Trinidad and Tobago Commonwealth of Independent States Latin America and the Caribbean Caribbean
  • Dimunindo bastante o número de iterações durante o treino fez com que o modelo generalizasse um pouco mais, mas parece ainda haver uma deficiência quanto à Região Latino-Americana.
  • Vale lembrar que o dataset usa uma definição de regiões diferente da ONU, e muitas das regiões definidas pela ONU não estão presentes no dataset original:
    • No dataset original, por exemplo, tudo o que seria "Eastern Africa" está classificado como "Sub-Saharan Africa", e então Djibouti estaria corretamente classificado.
In [144]:
df_target2.to_csv('descarte_balanceado_catboost_sem_paises.csv')
In [145]:
joblib.dump(cb1, 'cb1_sem_paises.joblib')
res2.to_csv('catboost_sem_paises_comp_direta.csv')





Q1 - Estimativa de Ladder score com base nos atributos¶

  • Remover Country name da análise
  • Usar Regional_indicator_consultado_Major e Regional_indicator_consultado no lugar de Regional indicator, por trazerem dados mais completos
  • É possível estimar qualquer atributo, categórico ou numérico a partir dos demais nesta base.
  • Para tarefas de classificação ou estimação de atributos categóricos, um algoritmo como o CatBoost que processa atributos categóricos e numéricos é indicado
  • Estimar o Ladder Score, ou qualquer outro atributo numérico, é uma tarefa de regressão
In [146]:
df1 = pd.read_csv('df_preenchido_descarte.csv', index_col=0) # dataset com dados excluídos
In [147]:
df1.drop_duplicates(subset=None, keep='first', inplace=True, ignore_index=True)
aux_columns = df1[['Regional_indicator_consultado_Major','Regional_indicator_consultado']]
df1.drop(columns=['Country name','Regional indicator'], inplace=True) # removendo colunas auxiliares
df1
Out[147]:
Regional_indicator_consultado_Major Regional_indicator_consultado year Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 Asia Southern Asia 2008 3.723590 7.370100 0.450662 50.799999 0.718114 0.167640 0.881686 0.517637 0.258195
1 Asia Southern Asia 2009 4.401778 7.539972 0.552308 51.200001 0.678896 0.190099 0.850035 0.583926 0.237092
2 Asia Southern Asia 2010 4.758381 7.646709 0.539075 51.599998 0.600127 0.120590 0.706766 0.618265 0.275324
3 Asia Southern Asia 2011 3.831719 7.619532 0.521104 51.919998 0.495901 0.162427 0.731109 0.611387 0.267175
4 Asia Southern Asia 2012 3.782938 7.705479 0.520637 52.240002 0.530935 0.236032 0.775620 0.710385 0.267919
... ... ... ... ... ... ... ... ... ... ... ... ...
2009 Africa Eastern Africa 2017 3.638300 8.015738 0.754147 55.000000 0.752826 -0.097645 0.751208 0.806428 0.224051
2010 Africa Eastern Africa 2018 3.616480 8.048798 0.775388 55.599998 0.762675 -0.068427 0.844209 0.710119 0.211726
2011 Africa Eastern Africa 2019 2.693523 7.950132 0.759162 56.200001 0.631908 -0.063791 0.830652 0.716004 0.235354
2012 Africa Eastern Africa 2020 3.159802 7.828757 0.717243 56.799999 0.643303 -0.008696 0.788523 0.702573 0.345736
2013 Africa Eastern Africa 2021 3.144800 7.942595 0.750470 56.200840 0.676700 -0.047346 0.820999 0.717712 0.224420

2014 rows × 12 columns

In [148]:
df1[df1['Ladder score'].isna()].shape
Out[148]:
(0, 12)
  • não há amostras com o atributo 'Ladder score' vazio
In [149]:
df1.Regional_indicator_consultado.value_counts()
Out[149]:
Western Asia                 225
Southern Europe              172
Western Africa               172
Eastern Africa               161
South America                157
Eastern Europe               146
Northern Europe              143
South-Eastern Asia           125
Central America              109
Southern Asia                106
Western Europe                99
Middle Africa                 69
Central Asia                  62
Northern Africa               61
Eastern Asia                  60
Southern Africa               45
Caribbean                     40
Northern America              32
Australia and New Zealand     30
Name: Regional_indicator_consultado, dtype: int64
  • desbalanceado

Extra Trees¶

SMOTE Extra Trees¶

In [150]:
cat = np.where( (df1.drop(columns='Ladder score').dtypes != float) & (df1.drop(columns='Ladder score').dtypes != 'int64') )[0]
In [151]:
smnc = SMOTENC(random_state=42, n_jobs=-1, categorical_features=cat)
X_res, y_res = smnc.fit_resample(df1.drop(columns=['Regional_indicator_consultado']), df1['Regional_indicator_consultado'])
In [152]:
X_res = pd.concat([X_res,y_res], axis=1)
df3 = pd.get_dummies(X_res, columns=['Regional_indicator_consultado_Major', 'Regional_indicator_consultado'], prefix='', prefix_sep='', sparse=False, dtype=bool)
In [153]:
X_res = pd.concat([X_res,y_res], axis=1)
df3 = pd.get_dummies(df1, columns=['Regional_indicator_consultado_Major', 'Regional_indicator_consultado'], prefix='', prefix_sep='', sparse=False, dtype=bool)
In [154]:
print(f"amostras antes: {df1.shape[0]}, amostras depois: {df3.shape[0]}")
amostras antes: 2014, amostras depois: 2014

Preparação e treino do modelo¶

In [155]:
X_train, X_test, y_train, y_test = train_test_split(df3.drop(columns=['Ladder score']), df3['Ladder score'], random_state=42, test_size=0.15)
In [156]:
reg = ExtraTreesRegressor(n_estimators=100, random_state=0).fit(X_train, y_train)
In [157]:
reg.score(X_test, y_test)
Out[157]:
0.920583271976963
In [158]:
reg.score(X_test, y_test)
Out[158]:
0.920583271976963

Resultados¶

In [159]:
y_pred_reg = reg.predict(X_test)
print(f"MAE: {round(mean_absolute_error(y_test, y_pred_reg),4)}\nR2: {round(r2_score(y_test, y_pred_reg),4)}\nExp. Variance: {round(explained_variance_score(y_test, y_pred_reg),4)}\
\nMax. Error: {round(max_error(y_test, y_pred_reg),4)}\nMSE: {round(mean_squared_error(y_test, y_pred_reg),4)}")
#print(classification_report(y_test, cb1.predict(X_test)))#, target_names=labels))
MAE: 0.2384
R2: 0.9206
Exp. Variance: 0.9212
Max. Error: 0.9347
MSE: 0.0991
In [160]:
y_pred_reg = reg.predict(X_test)
print(f"MAE: {round(mean_absolute_error(y_test, y_pred_reg),4)}\nR2: {round(r2_score(y_test, y_pred_reg),4)}\nExp. Variance: {round(explained_variance_score(y_test, y_pred_reg),4)}\
\nMax. Error: {round(max_error(y_test, y_pred_reg),4)}\nMSE: {round(mean_squared_error(y_test, y_pred_reg),4)}")
#print(classification_report(y_test, cb1.predict(X_test)))#, target_names=labels))
MAE: 0.2384
R2: 0.9206
Exp. Variance: 0.9212
Max. Error: 0.9347
MSE: 0.0991
In [161]:
reg_res = pd.DataFrame({'real':[*y_test], 'estimado':[*y_pred_reg]})
In [162]:
fig = go.Figure()
for column in reg_res:
    fig.add_trace(go.Scatter( x=reg_res.index, y=reg_res[column], name = column, mode = 'lines') )
fig.add_trace(go.Scatter( x=reg_res.index, y=(reg_res.real-reg_res.estimado), name='Diff', mode='lines') )
fig.update_layout(title = "Erro de predição", xaxis_title = 'Amostra')
In [163]:
joblib.dump(reg,'extra_trees_ladder_score.joblib')
Out[163]:
['extra_trees_ladder_score.joblib']
In [164]:
reg_res.index = range(reg_res.shape[0])
X_test.index = range(X_test.shape[0])
reg_res = pd.concat([reg_res,X_test], axis=1)
reg_res.to_csv('comparcao_direta_ls_extra_trees.csv')



CatBoost¶

SMOTE CatBoost¶

In [165]:
cat = np.where( (df1.drop(columns='Ladder score').dtypes != float) & (df1.drop(columns='Ladder score').dtypes != 'int64') )[0]
In [166]:
smnc = SMOTENC(random_state=42, n_jobs=-1, categorical_features=cat)
X_res, y_res = smnc.fit_resample(df1.drop(columns=['Regional_indicator_consultado']), df1['Regional_indicator_consultado'])
In [167]:
print(f"amostras antes: {df1.shape[0]}, amostras depois: {X_res.shape[0]}")
amostras antes: 2014, amostras depois: 4275

Preparação do modelo¶

In [168]:
X_res = pd.concat([X_res,y_res], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X_res.drop(columns=['Ladder score']), X_res['Ladder score'], random_state=42, test_size=0.15)
In [169]:
cb = CatBoostRegressor(
    loss_function='RMSE',
    random_seed=42,
    logging_level='Silent',
    #iterations=150
)

Treino¶

In [170]:
cat = np.where( (X_res.drop(columns='Ladder score').dtypes != float) & (X_res.drop(columns='Ladder score').dtypes != 'int64') )[0]
In [171]:
cb.fit(
    X_train, y_train,
    cat_features=cat,
    eval_set=(X_test, y_test),
#     logging_level='Verbose',  # you can uncomment this for text output
    plot=True
);
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
In [172]:
cb.score(X_test,y_test)
Out[172]:
0.9511750848469631
In [173]:
cb.best_score_
Out[173]:
{'learn': {'RMSE': 0.12157044081061888},
 'validation': {'RMSE': 0.2483869872117922}}

Resultados¶

In [174]:
y_pred_reg = cb.predict(X_test)
print(f"MAE: {round(mean_absolute_error(y_test, y_pred_reg),4)}\nR2: {round(r2_score(y_test, y_pred_reg),4)}\nExp. Variance: {round(explained_variance_score(y_test, y_pred_reg),4)}\
\nMax. Error: {round(max_error(y_test, y_pred_reg),4)}\nMSE: {round(mean_squared_error(y_test, y_pred_reg),4)}")
#print(classification_report(y_test, cb1.predict(X_test)))#, target_names=labels))
MAE: 0.1702
R2: 0.9512
Exp. Variance: 0.9512
Max. Error: 1.3694
MSE: 0.0617
In [175]:
reg_res = pd.DataFrame({'real':[*y_test], 'estimado':[*y_pred_reg]})
In [176]:
fig = go.Figure()
for column in reg_res:
    fig.add_trace(go.Scatter( x=reg_res.index, y=reg_res[column], name = column, mode = 'lines') )
fig.add_trace(go.Scatter( x=reg_res.index, y=(reg_res.real-reg_res.estimado), name='erro', mode='lines') )
fig.update_layout(title = "Erro de predição", xaxis_title = 'Amostra')
In [177]:
joblib.dump(cb,'catboost_ladder_score.joblib')
Out[177]:
['catboost_ladder_score.joblib']
In [178]:
reg_res.index = range(reg_res.shape[0])
X_test.index = range(X_test.shape[0])
reg_res = pd.concat([reg_res,X_test], axis=1)
reg_res.to_csv('comparcao_direta_ls_catboost.csv')
In [ ]: